Skip to content

Commit

Permalink
mm, swap: fix race between swap count continuation operations
Browse files Browse the repository at this point in the history
One page may store a set of entries of the sis->swap_map
(swap_info_struct->swap_map) in multiple swap clusters.

If some of the entries has sis->swap_map[offset] > SWAP_MAP_MAX,
multiple pages will be used to store the set of entries of the
sis->swap_map.  And the pages are linked with page->lru.  This is called
swap count continuation.  To access the pages which store the set of
entries of the sis->swap_map simultaneously, previously, sis->lock is
used.  But to improve the scalability of __swap_duplicate(), swap
cluster lock may be used in swap_count_continued() now.  This may race
with add_swap_count_continuation() which operates on a nearby swap
cluster, in which the sis->swap_map entries are stored in the same page.

The race can cause wrong swap count in practice, thus cause unfreeable
swap entries or software lockup, etc.

To fix the race, a new spin lock called cont_lock is added to struct
swap_info_struct to protect the swap count continuation page list.  This
is a lock at the swap device level, so the scalability isn't very well.
But it is still much better than the original sis->lock, because it is
only acquired/released when swap count continuation is used.  Which is
considered rare in practice.  If it turns out that the scalability
becomes an issue for some workloads, we can split the lock into some
more fine grained locks.

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 235b621 ("mm/swap: add cluster lock")
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: <[email protected]>	[4.11+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
  • Loading branch information
yhuang-intel authored and torvalds committed Nov 3, 2017
1 parent dd8a67f commit 2628bd6
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 6 deletions.
4 changes: 4 additions & 0 deletions include/linux/swap.h
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,10 @@ struct swap_info_struct {
* both locks need hold, hold swap_lock
* first.
*/
spinlock_t cont_lock; /*
* protect swap count continuation page
* list.
*/
struct work_struct discard_work; /* discard worker */
struct swap_cluster_list discard_clusters; /* discard clusters list */
};
Expand Down
23 changes: 17 additions & 6 deletions mm/swapfile.c
Original file line number Diff line number Diff line change
Expand Up @@ -2869,6 +2869,7 @@ static struct swap_info_struct *alloc_swap_info(void)
p->flags = SWP_USED;
spin_unlock(&swap_lock);
spin_lock_init(&p->lock);
spin_lock_init(&p->cont_lock);

return p;
}
Expand Down Expand Up @@ -3545,6 +3546,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
head = vmalloc_to_page(si->swap_map + offset);
offset &= ~PAGE_MASK;

spin_lock(&si->cont_lock);
/*
* Page allocation does not initialize the page's lru field,
* but it does always reset its private field.
Expand All @@ -3564,7 +3566,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
* a continuation page, free our allocation and use this one.
*/
if (!(count & COUNT_CONTINUED))
goto out;
goto out_unlock_cont;

map = kmap_atomic(list_page) + offset;
count = *map;
Expand All @@ -3575,11 +3577,13 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
* free our allocation and use this one.
*/
if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
goto out;
goto out_unlock_cont;
}

list_add_tail(&page->lru, &head->lru);
page = NULL; /* now it's attached, don't free it */
out_unlock_cont:
spin_unlock(&si->cont_lock);
out:
unlock_cluster(ci);
spin_unlock(&si->lock);
Expand All @@ -3604,13 +3608,15 @@ static bool swap_count_continued(struct swap_info_struct *si,
struct page *head;
struct page *page;
unsigned char *map;
bool ret;

head = vmalloc_to_page(si->swap_map + offset);
if (page_private(head) != SWP_CONTINUED) {
BUG_ON(count & COUNT_CONTINUED);
return false; /* need to add count continuation */
}

spin_lock(&si->cont_lock);
offset &= ~PAGE_MASK;
page = list_entry(head->lru.next, struct page, lru);
map = kmap_atomic(page) + offset;
Expand All @@ -3631,8 +3637,10 @@ static bool swap_count_continued(struct swap_info_struct *si,
if (*map == SWAP_CONT_MAX) {
kunmap_atomic(map);
page = list_entry(page->lru.next, struct page, lru);
if (page == head)
return false; /* add count continuation */
if (page == head) {
ret = false; /* add count continuation */
goto out;
}
map = kmap_atomic(page) + offset;
init_map: *map = 0; /* we didn't zero the page */
}
Expand All @@ -3645,7 +3653,7 @@ init_map: *map = 0; /* we didn't zero the page */
kunmap_atomic(map);
page = list_entry(page->lru.prev, struct page, lru);
}
return true; /* incremented */
ret = true; /* incremented */

} else { /* decrementing */
/*
Expand All @@ -3671,8 +3679,11 @@ init_map: *map = 0; /* we didn't zero the page */
kunmap_atomic(map);
page = list_entry(page->lru.prev, struct page, lru);
}
return count == COUNT_CONTINUED;
ret = count == COUNT_CONTINUED;
}
out:
spin_unlock(&si->cont_lock);
return ret;
}

/*
Expand Down

0 comments on commit 2628bd6

Please sign in to comment.