Skip to content

Commit

Permalink
mm/mempolicy: add set_mempolicy_home_node syscall
Browse files Browse the repository at this point in the history
This syscall can be used to set a home node for the MPOL_BIND and
MPOL_PREFERRED_MANY memory policy.  Users should use this syscall after
setting up a memory policy for the specified range as shown below.

  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
        new_nodes->size + 1, 0);
  sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
				home_node, 0);

The syscall allows specifying a home node/preferred node from which
kernel will fulfill memory allocation requests first.

For address range with MPOL_BIND memory policy, if nodemask specifies
more than one node, page allocations will come from the node in the
nodemask with sufficient free memory that is closest to the home
node/preferred node.

For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
page allocation will come from the node in the nodemask with sufficient
free memory that is closest to the home node/preferred node.  If there
is not enough memory in all the nodes specified in the nodemask, the
allocation will be attempted from the closest numa node to the home node
in the system.

This helps applications to hint at a memory allocation preference node
and fallback to _only_ a set of nodes if the memory is not available on
the preferred node.  Fallback allocation is attempted from the node
which is nearest to the preferred node.

This helps applications to have control on memory allocation numa nodes
and avoids default fallback to slow memory NUMA nodes.  For example a
system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of
slow memory

 new_nodes = numa_bitmask_alloc(nr_nodes);

 numa_bitmask_setbit(new_nodes, 1);
 numa_bitmask_setbit(new_nodes, 2);
 numa_bitmask_setbit(new_nodes, 3);

 p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
 mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);

 sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);

This will allocate from nodes closer to node 2 and will make sure the
kernel will only allocate from nodes 1, 2, and 3.  Memory will not be
allocated from slow memory nodes 10, 11, and 12.  This differs from
default MPOL_BIND behavior in that with default MPOL_BIND the allocation
will be attempted from node closer to the local node.  One of the
reasons to specify a home node is to allow allocations from cpu less
NUMA node and its nearby NUMA nodes.

With MPOL_PREFERRED_MANY on the other hand will first try to allocate
from the closest node to node 2 from the node list 1, 2 and 3.  If those
nodes don't have enough memory, kernel will allocate from slow memory
node 10, 11 and 12 which ever is closer to node 2.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Cc: Ben Widawsky <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Feng Tang <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
kvaneesh authored and torvalds committed Jan 15, 2022
1 parent c045511 commit c6018b4
Show file tree
Hide file tree
Showing 3 changed files with 95 additions and 1 deletion.
16 changes: 15 additions & 1 deletion Documentation/admin-guide/mm/numa_memory_policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -408,7 +408,7 @@ follows:
Memory Policy APIs
==================

Linux supports 3 system calls for controlling memory policy. These APIS
Linux supports 4 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space.

Expand Down Expand Up @@ -460,6 +460,20 @@ requested via the 'flags' argument.

See the mbind(2) man page for more details.

Set home node for a Range of Task's Address Spacec::

long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
unsigned long home_node,
unsigned long flags);

sys_set_mempolicy_home_node set the home node for a VMA policy present in the
task's address range. The system call updates the home node only for the existing
mempolicy range. Other address ranges are ignored. A home node is the NUMA node
closest to which page allocation will come from. Specifying the home node override
the default allocation policy to allocate memory close to the local node for an
executing CPU.


Memory Policy Command Line Interface
====================================

Expand Down
1 change: 1 addition & 0 deletions include/linux/mempolicy.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ struct mempolicy {
unsigned short mode; /* See MPOL_* above */
unsigned short flags; /* See set_mempolicy() MPOL_F_* above */
nodemask_t nodes; /* interleave/bind/perfer */
int home_node; /* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */

union {
nodemask_t cpuset_mems_allowed; /* relative to these nodes */
Expand Down
79 changes: 79 additions & 0 deletions mm/mempolicy.c
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
atomic_set(&policy->refcnt, 1);
policy->mode = mode;
policy->flags = flags;
policy->home_node = NUMA_NO_NODE;

return policy;
}
Expand Down Expand Up @@ -1478,6 +1479,77 @@ static long kernel_mbind(unsigned long start, unsigned long len,
return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
}

SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len,
unsigned long, home_node, unsigned long, flags)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
struct mempolicy *new;
unsigned long vmstart;
unsigned long vmend;
unsigned long end;
int err = -ENOENT;

start = untagged_addr(start);
if (start & ~PAGE_MASK)
return -EINVAL;
/*
* flags is used for future extension if any.
*/
if (flags != 0)
return -EINVAL;

/*
* Check home_node is online to avoid accessing uninitialized
* NODE_DATA.
*/
if (home_node >= MAX_NUMNODES || !node_online(home_node))
return -EINVAL;

len = (len + PAGE_SIZE - 1) & PAGE_MASK;
end = start + len;

if (end < start)
return -EINVAL;
if (end == start)
return 0;
mmap_write_lock(mm);
vma = find_vma(mm, start);
for (; vma && vma->vm_start < end; vma = vma->vm_next) {

vmstart = max(start, vma->vm_start);
vmend = min(end, vma->vm_end);
new = mpol_dup(vma_policy(vma));
if (IS_ERR(new)) {
err = PTR_ERR(new);
break;
}
/*
* Only update home node if there is an existing vma policy
*/
if (!new)
continue;

/*
* If any vma in the range got policy other than MPOL_BIND
* or MPOL_PREFERRED_MANY we return error. We don't reset
* the home node for vmas we already updated before.
*/
if (new->mode != MPOL_BIND && new->mode != MPOL_PREFERRED_MANY) {
err = -EOPNOTSUPP;
break;
}

new->home_node = home_node;
err = mbind_range(mm, vmstart, vmend, new);
mpol_put(new);
if (err)
break;
}
mmap_write_unlock(mm);
return err;
}

SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
unsigned long, mode, const unsigned long __user *, nmask,
unsigned long, maxnode, unsigned int, flags)
Expand Down Expand Up @@ -1802,6 +1874,11 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
}

if ((policy->mode == MPOL_BIND ||
policy->mode == MPOL_PREFERRED_MANY) &&
policy->home_node != NUMA_NO_NODE)
return policy->home_node;

return nd;
}

Expand Down Expand Up @@ -2344,6 +2421,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
return false;
if (a->flags != b->flags)
return false;
if (a->home_node != b->home_node)
return false;
if (mpol_store_user_nodemask(a))
if (!nodes_equal(a->w.user_nodemask, b->w.user_nodemask))
return false;
Expand Down

0 comments on commit c6018b4

Please sign in to comment.