Skip to content

Commit

Permalink
mempolicy: use MPOL_PREFERRED for system-wide default policy
Browse files Browse the repository at this point in the history
Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API
[set_mempolicy(), mbind() and internal versions], the kernel simply installs a
NULL struct mempolicy pointer in the appropriate context: task policy, vma
policy, or shared policy.  This causes any use of that policy to "fall back"
to the next most specific policy scope.

The only use of MPOL_DEFAULT to mean "local allocation" is in the system
default policy.  This requires extra checks/cases for MPOL_DEFAULT in many
mempolicy.c functions.

There is another, "preferred" way to specify local allocation via the APIs.
That is using the MPOL_PREFERRED policy mode with an empty nodemask.
Internally, the empty nodemask gets converted to a preferred_node id of '-1'.
All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the
node local to the cpu where the allocation occurs.

System default policy, except during boot, is hard-coded to "local
allocation".  By using the MPOL_PREFERRED mode with a negative value of
preferred node for system default policy, MPOL_DEFAULT will never occur in the
'policy' member of a struct mempolicy.  Thus, we can remove all checks for
MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation
paths.

In slab_node() return local node id when policy pointer is NULL.  No need to
set a pol value to take the switch default.  Replace switch default with
BUG()--i.e., shouldn't happen.

With this patch MPOL_DEFAULT is only used in the APIs, including internal
calls to do_set_mempolicy() and in the display of policy in
/proc/<pid>/numa_maps.  It always means "fall back" to the the next most
specific policy scope.  This simplifies the description of memory policies
quite a bit, with no visible change in behavior.

get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when
the requested policy [task or vma/shared] is NULL.  These are the values one
would supply via set_mempolicy() or mbind() to achieve that condition--default
behavior.

This patch updates Documentation to reflect this change.

Signed-off-by: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
Lee Schermerhorn authored and torvalds committed Apr 28, 2008
1 parent 52cd3b0 commit bea904d
Show file tree
Hide file tree
Showing 2 changed files with 60 additions and 62 deletions.
54 changes: 18 additions & 36 deletions Documentation/vm/numa_memory_policy.txt
Original file line number Diff line number Diff line change
Expand Up @@ -147,35 +147,18 @@ Components of Memory Policies

Linux memory policy supports the following 4 behavioral modes:

Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
context or scope dependent.

As mentioned in the Policy Scope section above, during normal
system operation, the System Default Policy is hard coded to
contain the Default mode.

In this context, default mode means "local" allocation--that is
attempt to allocate the page from the node associated with the cpu
where the fault occurs. If the "local" node has no memory, or the
node's memory can be exhausted [no free pages available], local
allocation will "fallback to"--attempt to allocate pages from--
"nearby" nodes, in order of increasing "distance".

Implementation detail -- subject to change: "Fallback" uses
a per node list of sibling nodes--called zonelists--built at
boot time, or when nodes or memory are added or removed from
the system [memory hotplug]. These per node zonelist are
constructed with nodes in order of increasing distance based
on information provided by the platform firmware.

When a task/process policy or a shared policy contains the Default
mode, this also means "local allocation", as described above.

In the context of a VMA, Default mode means "fall back to task
policy"--which may or may not specify Default mode. Thus, Default
mode can not be counted on to mean local allocation when used
on a non-shared region of the address space. However, see
MPOL_PREFERRED below.
Default Mode--MPOL_DEFAULT: This mode is only used in the memory
policy APIs. Internally, MPOL_DEFAULT is converted to the NULL
memory policy in all policy scopes. Any existing non-default policy
will simply be removed when MPOL_DEFAULT is specified. As a result,
MPOL_DEFAULT means "fall back to the next most specific policy scope."

For example, a NULL or default task policy will fall back to the
system default policy. A NULL or default vma policy will fall
back to the task policy.

When specified in one of the memory policy APIs, the Default mode
does not use the optional set of nodes.

It is an error for the set of nodes specified for this policy to
be non-empty.
Expand All @@ -187,19 +170,18 @@ Components of Memory Policies

MPOL_PREFERRED: This mode specifies that the allocation should be
attempted from the single node specified in the policy. If that
allocation fails, the kernel will search other nodes, exactly as
it would for a local allocation that started at the preferred node
in increasing distance from the preferred node. "Local" allocation
policy can be viewed as a Preferred policy that starts at the node
allocation fails, the kernel will search other nodes, in order of
increasing distance from the preferred node based on information
provided by the platform firmware.
containing the cpu where the allocation takes place.

Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. A "distinguished
value of this preferred_node, currently '-1', is interpreted
as "the node containing the cpu where the allocation takes
place"--local allocation. This is the way to specify
local allocation for a specific range of addresses--i.e. for
VMA policies.
place"--local allocation. "Local" allocation policy can be
viewed as a Preferred policy that starts at the node containing
the cpu where the allocation takes place.

It is possible for the user to specify that local allocation is
always preferred by passing an empty nodemask with this mode.
Expand Down
68 changes: 42 additions & 26 deletions mm/mempolicy.c
Original file line number Diff line number Diff line change
Expand Up @@ -104,9 +104,13 @@ static struct kmem_cache *sn_cache;
policied. */
enum zone_type policy_zone = 0;

/*
* run-time system-wide default policy => local allocation
*/
struct mempolicy default_policy = {
.refcnt = ATOMIC_INIT(1), /* never free it */
.mode = MPOL_DEFAULT,
.mode = MPOL_PREFERRED,
.v = { .preferred_node = -1 },
};

static const struct mempolicy_operations {
Expand Down Expand Up @@ -189,7 +193,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
if (mode == MPOL_DEFAULT) {
if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
return NULL;
return NULL; /* simply delete any existing policy */
}
VM_BUG_ON(!nodes);

Expand Down Expand Up @@ -246,7 +250,6 @@ void __mpol_put(struct mempolicy *p)
{
if (!atomic_dec_and_test(&p->refcnt))
return;
p->mode = MPOL_DEFAULT;
kmem_cache_free(policy_cache, p);
}

Expand Down Expand Up @@ -626,13 +629,16 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
return 0;
}

/* Fill a zone bitmap for a policy */
static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
/*
* Return nodemask for policy for get_mempolicy() query
*/
static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
{
nodes_clear(*nodes);
if (p == &default_policy)
return;

switch (p->mode) {
case MPOL_DEFAULT:
break;
case MPOL_BIND:
/* Fall through */
case MPOL_INTERLEAVE:
Expand Down Expand Up @@ -686,6 +692,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
}

if (flags & MPOL_F_ADDR) {
/*
* Do NOT fall back to task policy if the
* vma/shared policy at addr is NULL. We
* want to return MPOL_DEFAULT in this case.
*/
down_read(&mm->mmap_sem);
vma = find_vma_intersection(mm, addr, addr+1);
if (!vma) {
Expand All @@ -700,7 +711,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
return -EINVAL;

if (!pol)
pol = &default_policy;
pol = &default_policy; /* indicates default behavior */

if (flags & MPOL_F_NODE) {
if (flags & MPOL_F_ADDR) {
Expand All @@ -715,8 +726,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
err = -EINVAL;
goto out;
}
} else
*policy = pol->mode | pol->flags;
} else {
*policy = pol == &default_policy ? MPOL_DEFAULT :
pol->mode;
*policy |= pol->flags;
}

if (vma) {
up_read(&current->mm->mmap_sem);
Expand All @@ -725,7 +739,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,

err = 0;
if (nmask)
get_zonemask(pol, nmask);
get_policy_nodemask(pol, nmask);

out:
mpol_cond_put(pol);
Expand Down Expand Up @@ -1286,8 +1300,7 @@ static struct mempolicy *get_vma_policy(struct task_struct *task,
addr);
if (vpol)
pol = vpol;
} else if (vma->vm_policy &&
vma->vm_policy->mode != MPOL_DEFAULT)
} else if (vma->vm_policy)
pol = vma->vm_policy;
}
if (!pol)
Expand Down Expand Up @@ -1334,7 +1347,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
nd = first_node(policy->v.nodes);
break;
case MPOL_INTERLEAVE: /* should not happen */
case MPOL_DEFAULT:
nd = numa_node_id();
break;
default:
Expand Down Expand Up @@ -1369,9 +1381,15 @@ static unsigned interleave_nodes(struct mempolicy *policy)
*/
unsigned slab_node(struct mempolicy *policy)
{
unsigned short pol = policy ? policy->mode : MPOL_DEFAULT;
if (!policy)
return numa_node_id();

switch (policy->mode) {
case MPOL_PREFERRED:
if (unlikely(policy->v.preferred_node >= 0))
return policy->v.preferred_node;
return numa_node_id();

switch (pol) {
case MPOL_INTERLEAVE:
return interleave_nodes(policy);

Expand All @@ -1390,13 +1408,8 @@ unsigned slab_node(struct mempolicy *policy)
return zone->node;
}

case MPOL_PREFERRED:
if (policy->v.preferred_node >= 0)
return policy->v.preferred_node;
/* Fall through */

default:
return numa_node_id();
BUG();
}
}

Expand Down Expand Up @@ -1650,8 +1663,6 @@ int __mpol_equal(struct mempolicy *a, struct mempolicy *b)
if (a->mode != MPOL_DEFAULT && !mpol_match_intent(a, b))
return 0;
switch (a->mode) {
case MPOL_DEFAULT:
return 1;
case MPOL_BIND:
/* Fall through */
case MPOL_INTERLEAVE:
Expand Down Expand Up @@ -1828,7 +1839,7 @@ void mpol_shared_policy_init(struct shared_policy *info, unsigned short policy,
if (policy != MPOL_DEFAULT) {
struct mempolicy *newpol;

/* Falls back to MPOL_DEFAULT on any error */
/* Falls back to NULL policy [MPOL_DEFAULT] on any error */
newpol = mpol_new(policy, flags, policy_nodes);
if (!IS_ERR(newpol)) {
/* Create pseudo-vma that contains just the policy */
Expand Down Expand Up @@ -1952,9 +1963,14 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
char *p = buffer;
int l;
nodemask_t nodes;
unsigned short mode = pol ? pol->mode : MPOL_DEFAULT;
unsigned short mode;
unsigned short flags = pol ? pol->flags : 0;

if (!pol || pol == &default_policy)
mode = MPOL_DEFAULT;
else
mode = pol->mode;

switch (mode) {
case MPOL_DEFAULT:
nodes_clear(nodes);
Expand Down

0 comments on commit bea904d

Please sign in to comment.