License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 15:07:57 +01:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2016-07-26 15:26:24 -07:00
|
|
|
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
|
|
|
|
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/sched.h>
|
2017-02-08 18:51:29 +01:00
|
|
|
#include <linux/sched/mm.h>
|
2017-02-08 18:51:30 +01:00
|
|
|
#include <linux/sched/coredump.h>
|
2016-07-26 15:26:24 -07:00
|
|
|
#include <linux/mmu_notifier.h>
|
|
|
|
#include <linux/rmap.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/mm_inline.h>
|
|
|
|
#include <linux/kthread.h>
|
|
|
|
#include <linux/khugepaged.h>
|
|
|
|
#include <linux/freezer.h>
|
|
|
|
#include <linux/mman.h>
|
|
|
|
#include <linux/hashtable.h>
|
|
|
|
#include <linux/userfaultfd_k.h>
|
|
|
|
#include <linux/page_idle.h>
|
|
|
|
#include <linux/swapops.h>
|
2016-07-26 15:26:32 -07:00
|
|
|
#include <linux/shmem_fs.h>
|
2016-07-26 15:26:24 -07:00
|
|
|
|
|
|
|
#include <asm/tlb.h>
|
|
|
|
#include <asm/pgalloc.h>
|
|
|
|
#include "internal.h"
|
|
|
|
|
2022-11-25 22:37:13 +01:00
|
|
|
/* gross hack for <=4.19 stable */
|
|
|
|
#if defined(CONFIG_S390) || defined(CONFIG_ARM)
|
|
|
|
static void tlb_remove_table_smp_sync(void *arg)
|
|
|
|
{
|
|
|
|
/* Simply deliver the interrupt */
|
|
|
|
}
|
|
|
|
|
|
|
|
static void tlb_remove_table_sync_one(void)
|
|
|
|
{
|
|
|
|
smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2016-07-26 15:26:24 -07:00
|
|
|
enum scan_result {
|
|
|
|
SCAN_FAIL,
|
|
|
|
SCAN_SUCCEED,
|
|
|
|
SCAN_PMD_NULL,
|
|
|
|
SCAN_EXCEED_NONE_PTE,
|
|
|
|
SCAN_PTE_NON_PRESENT,
|
|
|
|
SCAN_PAGE_RO,
|
2016-07-26 15:26:46 -07:00
|
|
|
SCAN_LACK_REFERENCED_PAGE,
|
2016-07-26 15:26:24 -07:00
|
|
|
SCAN_PAGE_NULL,
|
|
|
|
SCAN_SCAN_ABORT,
|
|
|
|
SCAN_PAGE_COUNT,
|
|
|
|
SCAN_PAGE_LRU,
|
|
|
|
SCAN_PAGE_LOCK,
|
|
|
|
SCAN_PAGE_ANON,
|
|
|
|
SCAN_PAGE_COMPOUND,
|
|
|
|
SCAN_ANY_PROCESS,
|
|
|
|
SCAN_VMA_NULL,
|
|
|
|
SCAN_VMA_CHECK,
|
|
|
|
SCAN_ADDRESS_RANGE,
|
|
|
|
SCAN_SWAP_CACHE_PAGE,
|
|
|
|
SCAN_DEL_PAGE_LRU,
|
|
|
|
SCAN_ALLOC_HUGE_PAGE_FAIL,
|
|
|
|
SCAN_CGROUP_CHARGE_FAIL,
|
2016-07-26 15:26:32 -07:00
|
|
|
SCAN_EXCEED_SWAP_PTE,
|
|
|
|
SCAN_TRUNCATED,
|
2016-07-26 15:26:24 -07:00
|
|
|
};
|
|
|
|
|
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include <trace/events/huge_memory.h>
|
|
|
|
|
2020-10-10 23:16:40 -07:00
|
|
|
static struct task_struct *khugepaged_thread __read_mostly;
|
|
|
|
static DEFINE_MUTEX(khugepaged_mutex);
|
|
|
|
|
2016-07-26 15:26:24 -07:00
|
|
|
/* default scan 8*512 pte (or vmas) every 30 second */
|
|
|
|
static unsigned int khugepaged_pages_to_scan __read_mostly;
|
|
|
|
static unsigned int khugepaged_pages_collapsed;
|
|
|
|
static unsigned int khugepaged_full_scans;
|
|
|
|
static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
|
|
|
|
/* during fragmentation poll the hugepage allocator once every minute */
|
|
|
|
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
|
|
|
|
static unsigned long khugepaged_sleep_expire;
|
|
|
|
static DEFINE_SPINLOCK(khugepaged_mm_lock);
|
|
|
|
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
|
|
|
|
/*
|
|
|
|
* default collapse hugepages if there is at least one pte mapped like
|
|
|
|
* it would have happened if the vma was large enough during page
|
|
|
|
* fault.
|
|
|
|
*/
|
|
|
|
static unsigned int khugepaged_max_ptes_none __read_mostly;
|
|
|
|
static unsigned int khugepaged_max_ptes_swap __read_mostly;
|
|
|
|
|
|
|
|
#define MM_SLOTS_HASH_BITS 10
|
|
|
|
static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
|
|
|
|
|
|
|
|
static struct kmem_cache *mm_slot_cache __read_mostly;
|
|
|
|
|
|
|
|
/**
|
|
|
|
* struct mm_slot - hash lookup from mm to mm_slot
|
|
|
|
* @hash: hash collision list
|
|
|
|
* @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
|
|
|
|
* @mm: the mm that this information is valid for
|
|
|
|
*/
|
|
|
|
struct mm_slot {
|
|
|
|
struct hlist_node hash;
|
|
|
|
struct list_head mm_node;
|
|
|
|
struct mm_struct *mm;
|
|
|
|
};
|
|
|
|
|
|
|
|
/**
|
|
|
|
* struct khugepaged_scan - cursor for scanning
|
|
|
|
* @mm_head: the head of the mm list to scan
|
|
|
|
* @mm_slot: the current mm_slot we are scanning
|
|
|
|
* @address: the next address inside that to be scanned
|
|
|
|
*
|
|
|
|
* There is only the one khugepaged_scan instance of this cursor structure.
|
|
|
|
*/
|
|
|
|
struct khugepaged_scan {
|
|
|
|
struct list_head mm_head;
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
unsigned long address;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct khugepaged_scan khugepaged_scan = {
|
|
|
|
.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
|
|
|
|
};
|
|
|
|
|
2016-11-30 15:54:02 -08:00
|
|
|
#ifdef CONFIG_SYSFS
|
2016-07-26 15:26:24 -07:00
|
|
|
static ssize_t scan_sleep_millisecs_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", khugepaged_scan_sleep_millisecs);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
unsigned long msecs;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &msecs);
|
|
|
|
if (err || msecs > UINT_MAX)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_scan_sleep_millisecs = msecs;
|
|
|
|
khugepaged_sleep_expire = 0;
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute scan_sleep_millisecs_attr =
|
|
|
|
__ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show,
|
|
|
|
scan_sleep_millisecs_store);
|
|
|
|
|
|
|
|
static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", khugepaged_alloc_sleep_millisecs);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
unsigned long msecs;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &msecs);
|
|
|
|
if (err || msecs > UINT_MAX)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_alloc_sleep_millisecs = msecs;
|
|
|
|
khugepaged_sleep_expire = 0;
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute alloc_sleep_millisecs_attr =
|
|
|
|
__ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show,
|
|
|
|
alloc_sleep_millisecs_store);
|
|
|
|
|
|
|
|
static ssize_t pages_to_scan_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", khugepaged_pages_to_scan);
|
|
|
|
}
|
|
|
|
static ssize_t pages_to_scan_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
unsigned long pages;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &pages);
|
|
|
|
if (err || !pages || pages > UINT_MAX)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_pages_to_scan = pages;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute pages_to_scan_attr =
|
|
|
|
__ATTR(pages_to_scan, 0644, pages_to_scan_show,
|
|
|
|
pages_to_scan_store);
|
|
|
|
|
|
|
|
static ssize_t pages_collapsed_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", khugepaged_pages_collapsed);
|
|
|
|
}
|
|
|
|
static struct kobj_attribute pages_collapsed_attr =
|
|
|
|
__ATTR_RO(pages_collapsed);
|
|
|
|
|
|
|
|
static ssize_t full_scans_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", khugepaged_full_scans);
|
|
|
|
}
|
|
|
|
static struct kobj_attribute full_scans_attr =
|
|
|
|
__ATTR_RO(full_scans);
|
|
|
|
|
|
|
|
static ssize_t khugepaged_defrag_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr, char *buf)
|
|
|
|
{
|
|
|
|
return single_hugepage_flag_show(kobj, attr, buf,
|
|
|
|
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
|
|
|
|
}
|
|
|
|
static ssize_t khugepaged_defrag_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
return single_hugepage_flag_store(kobj, attr, buf, count,
|
|
|
|
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
|
|
|
|
}
|
|
|
|
static struct kobj_attribute khugepaged_defrag_attr =
|
|
|
|
__ATTR(defrag, 0644, khugepaged_defrag_show,
|
|
|
|
khugepaged_defrag_store);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* max_ptes_none controls if khugepaged should collapse hugepages over
|
|
|
|
* any unmapped ptes in turn potentially increasing the memory
|
|
|
|
* footprint of the vmas. When max_ptes_none is 0 khugepaged will not
|
|
|
|
* reduce the available free memory in the system as it
|
|
|
|
* runs. Increasing max_ptes_none will instead potentially reduce the
|
|
|
|
* free memory in the system during the khugepaged scan.
|
|
|
|
*/
|
|
|
|
static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", khugepaged_max_ptes_none);
|
|
|
|
}
|
|
|
|
static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
unsigned long max_ptes_none;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &max_ptes_none);
|
|
|
|
if (err || max_ptes_none > HPAGE_PMD_NR-1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_max_ptes_none = max_ptes_none;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
static struct kobj_attribute khugepaged_max_ptes_none_attr =
|
|
|
|
__ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show,
|
|
|
|
khugepaged_max_ptes_none_store);
|
|
|
|
|
|
|
|
static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
char *buf)
|
|
|
|
{
|
|
|
|
return sprintf(buf, "%u\n", khugepaged_max_ptes_swap);
|
|
|
|
}
|
|
|
|
|
|
|
|
static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj,
|
|
|
|
struct kobj_attribute *attr,
|
|
|
|
const char *buf, size_t count)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
unsigned long max_ptes_swap;
|
|
|
|
|
|
|
|
err = kstrtoul(buf, 10, &max_ptes_swap);
|
|
|
|
if (err || max_ptes_swap > HPAGE_PMD_NR-1)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
khugepaged_max_ptes_swap = max_ptes_swap;
|
|
|
|
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct kobj_attribute khugepaged_max_ptes_swap_attr =
|
|
|
|
__ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show,
|
|
|
|
khugepaged_max_ptes_swap_store);
|
|
|
|
|
|
|
|
static struct attribute *khugepaged_attr[] = {
|
|
|
|
&khugepaged_defrag_attr.attr,
|
|
|
|
&khugepaged_max_ptes_none_attr.attr,
|
|
|
|
&pages_to_scan_attr.attr,
|
|
|
|
&pages_collapsed_attr.attr,
|
|
|
|
&full_scans_attr.attr,
|
|
|
|
&scan_sleep_millisecs_attr.attr,
|
|
|
|
&alloc_sleep_millisecs_attr.attr,
|
|
|
|
&khugepaged_max_ptes_swap_attr.attr,
|
|
|
|
NULL,
|
|
|
|
};
|
|
|
|
|
|
|
|
struct attribute_group khugepaged_attr_group = {
|
|
|
|
.attrs = khugepaged_attr,
|
|
|
|
.name = "khugepaged",
|
|
|
|
};
|
2016-11-30 15:54:02 -08:00
|
|
|
#endif /* CONFIG_SYSFS */
|
2016-07-26 15:26:24 -07:00
|
|
|
|
2016-07-26 15:26:32 -07:00
|
|
|
#define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
|
2016-07-26 15:26:24 -07:00
|
|
|
|
|
|
|
int hugepage_madvise(struct vm_area_struct *vma,
|
|
|
|
unsigned long *vm_flags, int advice)
|
|
|
|
{
|
|
|
|
switch (advice) {
|
|
|
|
case MADV_HUGEPAGE:
|
|
|
|
#ifdef CONFIG_S390
|
|
|
|
/*
|
|
|
|
* qemu blindly sets MADV_HUGEPAGE on all allocations, but s390
|
|
|
|
* can't handle this properly after s390_enable_sie, so we simply
|
|
|
|
* ignore the madvise to prevent qemu from causing a SIGSEGV.
|
|
|
|
*/
|
|
|
|
if (mm_has_pgste(vma->vm_mm))
|
|
|
|
return 0;
|
|
|
|
#endif
|
|
|
|
*vm_flags &= ~VM_NOHUGEPAGE;
|
|
|
|
*vm_flags |= VM_HUGEPAGE;
|
|
|
|
/*
|
|
|
|
* If the vma become good for khugepaged to scan,
|
|
|
|
* register it here without waiting a page fault that
|
|
|
|
* may not happen any time soon.
|
|
|
|
*/
|
|
|
|
if (!(*vm_flags & VM_NO_KHUGEPAGED) &&
|
|
|
|
khugepaged_enter_vma_merge(vma, *vm_flags))
|
|
|
|
return -ENOMEM;
|
|
|
|
break;
|
|
|
|
case MADV_NOHUGEPAGE:
|
|
|
|
*vm_flags &= ~VM_HUGEPAGE;
|
|
|
|
*vm_flags |= VM_NOHUGEPAGE;
|
|
|
|
/*
|
|
|
|
* Setting VM_NOHUGEPAGE will prevent khugepaged from scanning
|
|
|
|
* this vma even if we leave the mm registered in khugepaged if
|
|
|
|
* it got registered before VM_NOHUGEPAGE was set.
|
|
|
|
*/
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int __init khugepaged_init(void)
|
|
|
|
{
|
|
|
|
mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
|
|
|
|
sizeof(struct mm_slot),
|
|
|
|
__alignof__(struct mm_slot), 0, NULL);
|
|
|
|
if (!mm_slot_cache)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
khugepaged_pages_to_scan = HPAGE_PMD_NR * 8;
|
|
|
|
khugepaged_max_ptes_none = HPAGE_PMD_NR - 1;
|
|
|
|
khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void __init khugepaged_destroy(void)
|
|
|
|
{
|
|
|
|
kmem_cache_destroy(mm_slot_cache);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct mm_slot *alloc_mm_slot(void)
|
|
|
|
{
|
|
|
|
if (!mm_slot_cache) /* initialization failed */
|
|
|
|
return NULL;
|
|
|
|
return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void free_mm_slot(struct mm_slot *mm_slot)
|
|
|
|
{
|
|
|
|
kmem_cache_free(mm_slot_cache, mm_slot);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct mm_slot *get_mm_slot(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
|
|
|
|
hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
|
|
|
|
if (mm == mm_slot->mm)
|
|
|
|
return mm_slot;
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void insert_to_mm_slots_hash(struct mm_struct *mm,
|
|
|
|
struct mm_slot *mm_slot)
|
|
|
|
{
|
|
|
|
mm_slot->mm = mm;
|
|
|
|
hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline int khugepaged_test_exit(struct mm_struct *mm)
|
|
|
|
{
|
2020-08-06 23:26:25 -07:00
|
|
|
return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
int __khugepaged_enter(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
int wakeup;
|
|
|
|
|
|
|
|
mm_slot = alloc_mm_slot();
|
|
|
|
if (!mm_slot)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
/* __khugepaged_exit() must not run from under us */
|
2020-08-20 17:42:02 -07:00
|
|
|
VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm);
|
2016-07-26 15:26:24 -07:00
|
|
|
if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
|
|
|
|
free_mm_slot(mm_slot);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
insert_to_mm_slots_hash(mm, mm_slot);
|
|
|
|
/*
|
|
|
|
* Insert just behind the scanning cursor, to let the area settle
|
|
|
|
* down a little.
|
|
|
|
*/
|
|
|
|
wakeup = list_empty(&khugepaged_scan.mm_head);
|
|
|
|
list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
|
2017-02-27 14:30:07 -08:00
|
|
|
mmgrab(mm);
|
2016-07-26 15:26:24 -07:00
|
|
|
if (wakeup)
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
|
|
|
|
unsigned long vm_flags)
|
|
|
|
{
|
|
|
|
unsigned long hstart, hend;
|
|
|
|
if (!vma->anon_vma)
|
|
|
|
/*
|
|
|
|
* Not yet faulted in so we will register later in the
|
|
|
|
* page fault if needed.
|
|
|
|
*/
|
|
|
|
return 0;
|
|
|
|
if (vma->vm_ops || (vm_flags & VM_NO_KHUGEPAGED))
|
|
|
|
/* khugepaged not yet working on file or special mappings */
|
|
|
|
return 0;
|
|
|
|
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
|
|
|
|
hend = vma->vm_end & HPAGE_PMD_MASK;
|
|
|
|
if (hstart < hend)
|
|
|
|
return khugepaged_enter(vma, vm_flags);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void __khugepaged_exit(struct mm_struct *mm)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
int free = 0;
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
mm_slot = get_mm_slot(mm);
|
|
|
|
if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
|
|
|
|
hash_del(&mm_slot->hash);
|
|
|
|
list_del(&mm_slot->mm_node);
|
|
|
|
free = 1;
|
|
|
|
}
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
|
|
|
|
if (free) {
|
|
|
|
clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
|
|
|
|
free_mm_slot(mm_slot);
|
|
|
|
mmdrop(mm);
|
|
|
|
} else if (mm_slot) {
|
|
|
|
/*
|
|
|
|
* This is required to serialize against
|
|
|
|
* khugepaged_test_exit() (which is guaranteed to run
|
|
|
|
* under mmap sem read mode). Stop here (after we
|
|
|
|
* return all pagetables will be destroyed) until
|
|
|
|
* khugepaged has finished working on the pagetables
|
|
|
|
* under the mmap_sem.
|
|
|
|
*/
|
|
|
|
down_write(&mm->mmap_sem);
|
|
|
|
up_write(&mm->mmap_sem);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void release_pte_page(struct page *page)
|
|
|
|
{
|
2017-05-03 14:52:26 -07:00
|
|
|
dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
|
2016-07-26 15:26:24 -07:00
|
|
|
unlock_page(page);
|
|
|
|
putback_lru_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void release_pte_pages(pte_t *pte, pte_t *_pte)
|
|
|
|
{
|
|
|
|
while (--_pte >= pte) {
|
|
|
|
pte_t pteval = *_pte;
|
|
|
|
if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
|
|
|
|
release_pte_page(pte_page(pteval));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
pte_t *pte)
|
|
|
|
{
|
|
|
|
struct page *page = NULL;
|
|
|
|
pte_t *_pte;
|
2016-07-26 15:26:46 -07:00
|
|
|
int none_or_zero = 0, result = 0, referenced = 0;
|
|
|
|
bool writable = false;
|
2016-07-26 15:26:24 -07:00
|
|
|
|
|
|
|
for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
|
|
|
|
_pte++, address += PAGE_SIZE) {
|
|
|
|
pte_t pteval = *_pte;
|
|
|
|
if (pte_none(pteval) || (pte_present(pteval) &&
|
|
|
|
is_zero_pfn(pte_pfn(pteval)))) {
|
|
|
|
if (!userfaultfd_armed(vma) &&
|
|
|
|
++none_or_zero <= khugepaged_max_ptes_none) {
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
result = SCAN_EXCEED_NONE_PTE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!pte_present(pteval)) {
|
|
|
|
result = SCAN_PTE_NON_PRESENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
page = vm_normal_page(vma, address, pteval);
|
|
|
|
if (unlikely(!page)) {
|
|
|
|
result = SCAN_PAGE_NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-03-22 16:17:28 -07:00
|
|
|
/* TODO: teach khugepaged to collapse THP mapped with pte */
|
|
|
|
if (PageCompound(page)) {
|
|
|
|
result = SCAN_PAGE_COMPOUND;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-07-26 15:26:24 -07:00
|
|
|
VM_BUG_ON_PAGE(!PageAnon(page), page);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We can do it before isolate_lru_page because the
|
|
|
|
* page can't be freed from under us. NOTE: PG_lock
|
|
|
|
* is needed to serialize against split_huge_page
|
|
|
|
* when invoked from the VM.
|
|
|
|
*/
|
|
|
|
if (!trylock_page(page)) {
|
|
|
|
result = SCAN_PAGE_LOCK;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cannot use mapcount: can't collapse if there's a gup pin.
|
|
|
|
* The page must only be referenced by the scanned process
|
|
|
|
* and page swap cache.
|
|
|
|
*/
|
2017-05-03 14:53:35 -07:00
|
|
|
if (page_count(page) != 1 + PageSwapCache(page)) {
|
2016-07-26 15:26:24 -07:00
|
|
|
unlock_page(page);
|
|
|
|
result = SCAN_PAGE_COUNT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (pte_write(pteval)) {
|
|
|
|
writable = true;
|
|
|
|
} else {
|
|
|
|
if (PageSwapCache(page) &&
|
|
|
|
!reuse_swap_page(page, NULL)) {
|
|
|
|
unlock_page(page);
|
|
|
|
result = SCAN_SWAP_CACHE_PAGE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Page is not in the swap cache. It can be collapsed
|
|
|
|
* into a THP.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Isolate the page to avoid collapsing an hugepage
|
|
|
|
* currently in use by the VM.
|
|
|
|
*/
|
|
|
|
if (isolate_lru_page(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
result = SCAN_DEL_PAGE_LRU;
|
|
|
|
goto out;
|
|
|
|
}
|
2017-05-03 14:52:26 -07:00
|
|
|
inc_node_page_state(page,
|
|
|
|
NR_ISOLATED_ANON + page_is_file_cache(page));
|
2016-07-26 15:26:24 -07:00
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(PageLRU(page), page);
|
|
|
|
|
2016-07-26 15:26:46 -07:00
|
|
|
/* There should be enough young pte to collapse the page */
|
2016-07-26 15:26:24 -07:00
|
|
|
if (pte_young(pteval) ||
|
|
|
|
page_is_young(page) || PageReferenced(page) ||
|
|
|
|
mmu_notifier_test_young(vma->vm_mm, address))
|
2016-07-26 15:26:46 -07:00
|
|
|
referenced++;
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
2021-05-04 18:33:46 -07:00
|
|
|
|
|
|
|
if (unlikely(!writable)) {
|
2016-07-26 15:26:24 -07:00
|
|
|
result = SCAN_PAGE_RO;
|
2021-05-04 18:33:46 -07:00
|
|
|
} else if (unlikely(!referenced)) {
|
|
|
|
result = SCAN_LACK_REFERENCED_PAGE;
|
|
|
|
} else {
|
|
|
|
result = SCAN_SUCCEED;
|
|
|
|
trace_mm_collapse_huge_page_isolate(page, none_or_zero,
|
|
|
|
referenced, writable, result);
|
|
|
|
return 1;
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
release_pte_pages(pte, _pte);
|
|
|
|
trace_mm_collapse_huge_page_isolate(page, none_or_zero,
|
|
|
|
referenced, writable, result);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
spinlock_t *ptl)
|
|
|
|
{
|
|
|
|
pte_t *_pte;
|
2017-05-12 15:47:03 -07:00
|
|
|
for (_pte = pte; _pte < pte + HPAGE_PMD_NR;
|
|
|
|
_pte++, page++, address += PAGE_SIZE) {
|
2016-07-26 15:26:24 -07:00
|
|
|
pte_t pteval = *_pte;
|
|
|
|
struct page *src_page;
|
|
|
|
|
|
|
|
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
|
|
|
|
clear_user_highpage(page, address);
|
|
|
|
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
|
|
|
|
if (is_zero_pfn(pte_pfn(pteval))) {
|
|
|
|
/*
|
|
|
|
* ptl mostly unnecessary.
|
|
|
|
*/
|
|
|
|
spin_lock(ptl);
|
|
|
|
/*
|
|
|
|
* paravirt calls inside pte_clear here are
|
|
|
|
* superfluous.
|
|
|
|
*/
|
|
|
|
pte_clear(vma->vm_mm, address, _pte);
|
|
|
|
spin_unlock(ptl);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
src_page = pte_page(pteval);
|
|
|
|
copy_user_highpage(page, src_page, address, vma);
|
|
|
|
VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page);
|
|
|
|
release_pte_page(src_page);
|
|
|
|
/*
|
|
|
|
* ptl mostly unnecessary, but preempt has to
|
|
|
|
* be disabled to update the per-cpu stats
|
|
|
|
* inside page_remove_rmap().
|
|
|
|
*/
|
|
|
|
spin_lock(ptl);
|
|
|
|
/*
|
|
|
|
* paravirt calls inside pte_clear here are
|
|
|
|
* superfluous.
|
|
|
|
*/
|
|
|
|
pte_clear(vma->vm_mm, address, _pte);
|
|
|
|
page_remove_rmap(src_page, false);
|
|
|
|
spin_unlock(ptl);
|
|
|
|
free_page_and_swap_cache(src_page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void khugepaged_alloc_sleep(void)
|
|
|
|
{
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
|
|
|
|
add_wait_queue(&khugepaged_wait, &wait);
|
|
|
|
freezable_schedule_timeout_interruptible(
|
|
|
|
msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
|
|
|
|
remove_wait_queue(&khugepaged_wait, &wait);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_node_load[MAX_NUMNODES];
|
|
|
|
|
|
|
|
static bool khugepaged_scan_abort(int nid)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/*
|
2016-07-28 15:46:32 -07:00
|
|
|
* If node_reclaim_mode is disabled, then no extra effort is made to
|
2016-07-26 15:26:24 -07:00
|
|
|
* allocate memory locally.
|
|
|
|
*/
|
2016-07-28 15:46:32 -07:00
|
|
|
if (!node_reclaim_mode)
|
2016-07-26 15:26:24 -07:00
|
|
|
return false;
|
|
|
|
|
|
|
|
/* If there is a count for this node already, it must be acceptable */
|
|
|
|
if (khugepaged_node_load[nid])
|
|
|
|
return false;
|
|
|
|
|
|
|
|
for (i = 0; i < MAX_NUMNODES; i++) {
|
|
|
|
if (!khugepaged_node_load[i])
|
|
|
|
continue;
|
|
|
|
if (node_distance(nid, i) > RECLAIM_DISTANCE)
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
|
|
|
|
static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
|
|
|
|
{
|
mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY. This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore. We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.
We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
The patch effectively changes the current GFP_TRANSHUGE users as
follows:
* get_huge_zero_page() - the zero page lifetime should be relatively
long and it's shared by multiple users, so it's worth spending some
effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
This also restores direct reclaim to this allocation, which was
unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
by default to madvise and add a stall-free defrag option")
* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
is not an issue. So if khugepaged "defrag" is enabled (the default), do
reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
PF_KTHREAD check from page alloc.
As a side-effect, khugepaged will now no longer check if the initial
compaction was deferred or contended. This is OK, as khugepaged sleep
times between collapsion attempts are long enough to prevent noticeable
disruption, so we should allow it to spend some effort.
* migrate_misplaced_transhuge_page() - already was masking out
__GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
equivalent.
* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
are now allocating without __GFP_NORETRY. Other vma's keep using
__GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
it's allowed only for madvised vma's). The rest is conversion to
GFP_TRANSHUGE(_LIGHT).
[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 15:49:25 -07:00
|
|
|
return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
static int khugepaged_find_target_node(void)
|
|
|
|
{
|
|
|
|
static int last_khugepaged_target_node = NUMA_NO_NODE;
|
|
|
|
int nid, target_node = 0, max_value = 0;
|
|
|
|
|
|
|
|
/* find first node with max normal pages hit */
|
|
|
|
for (nid = 0; nid < MAX_NUMNODES; nid++)
|
|
|
|
if (khugepaged_node_load[nid] > max_value) {
|
|
|
|
max_value = khugepaged_node_load[nid];
|
|
|
|
target_node = nid;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* do some balance if several nodes have the same hit record */
|
|
|
|
if (target_node <= last_khugepaged_target_node)
|
|
|
|
for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
|
|
|
|
nid++)
|
|
|
|
if (max_value == khugepaged_node_load[nid]) {
|
|
|
|
target_node = nid;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
last_khugepaged_target_node = target_node;
|
|
|
|
return target_node;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
|
|
|
|
{
|
|
|
|
if (IS_ERR(*hpage)) {
|
|
|
|
if (!*wait)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
*wait = false;
|
|
|
|
*hpage = NULL;
|
|
|
|
khugepaged_alloc_sleep();
|
|
|
|
} else if (*hpage) {
|
|
|
|
put_page(*hpage);
|
|
|
|
*hpage = NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct page *
|
2016-07-26 15:26:26 -07:00
|
|
|
khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
|
2016-07-26 15:26:24 -07:00
|
|
|
{
|
|
|
|
VM_BUG_ON_PAGE(*hpage, *hpage);
|
|
|
|
|
|
|
|
*hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
|
|
|
|
if (unlikely(!*hpage)) {
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
|
|
|
|
*hpage = ERR_PTR(-ENOMEM);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
|
|
|
prep_transhuge_page(*hpage);
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC);
|
|
|
|
return *hpage;
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static int khugepaged_find_target_node(void)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline struct page *alloc_khugepaged_hugepage(void)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
page = alloc_pages(alloc_hugepage_khugepaged_gfpmask(),
|
|
|
|
HPAGE_PMD_ORDER);
|
|
|
|
if (page)
|
|
|
|
prep_transhuge_page(page);
|
|
|
|
return page;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct page *khugepaged_alloc_hugepage(bool *wait)
|
|
|
|
{
|
|
|
|
struct page *hpage;
|
|
|
|
|
|
|
|
do {
|
|
|
|
hpage = alloc_khugepaged_hugepage();
|
|
|
|
if (!hpage) {
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
|
|
|
|
if (!*wait)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
*wait = false;
|
|
|
|
khugepaged_alloc_sleep();
|
|
|
|
} else
|
|
|
|
count_vm_event(THP_COLLAPSE_ALLOC);
|
|
|
|
} while (unlikely(!hpage) && likely(khugepaged_enabled()));
|
|
|
|
|
|
|
|
return hpage;
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
|
|
|
|
{
|
mm/khugepaged: fix filemap page_to_pgoff(page) != offset
commit 033b5d77551167f8c24ca862ce83d3e0745f9245 upstream.
There have been elusive reports of filemap_fault() hitting its
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
with CONFIG_READ_ONLY_THP_FOR_FS=y.
Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
without NUMA reuses the same huge page after collapse_file() failed
(whereas NUMA targets its allocation to the respective node each time).
And most of us were usually testing with CONFIG_NUMA=y kernels.
collapse_file(old start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=old offset
new_page->mapping = mapping
xas_store(&xas, new_page)
filemap_fault
page = find_get_page(mapping, offset)
// if offset falls inside hpage then
// compound_head(page) == hpage
lock_page_maybe_drop_mmap()
__lock_page(page)
// collapse fails
xas_store(&xas, old page)
new_page->mapping = NULL
unlock_page(new_page)
collapse_file(new start)
new_page = khugepaged_alloc_page(hpage)
__SetPageLocked(new_page)
new_page->index = start // hpage->index=new offset
new_page->mapping = mapping // mapping becomes valid again
// since compound_head(page) == hpage
// page_to_pgoff(page) got changed
VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
An initial patch replaced __SetPageLocked() by lock_page(), which did
fix the race which Suren illustrates above. But testing showed that it's
not good enough: if the racing task's __lock_page() gets delayed long
after its find_get_page(), then it may follow collapse_file(new start)'s
successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
check and retry (as is done for mapping), with similar relaxations in
find_lock_entry() and pagecache_get_page(): but it's not obvious what
else might get caught out; and khugepaged non-NUMA appears to be unique
in exposing a page to page cache, then revoking, without going through
a full cycle of freeing before reuse.
Instead, non-NUMA khugepaged_prealloc_page() release the old page
if anyone else has a reference to it (1% of cases when I tested).
Although never reported on huge tmpfs, I believe its find_lock_entry()
has been at similar risk; but huge tmpfs does not rely on khugepaged
for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
Reported-by: Denis Lisov <dennis.lissov@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org
Reported-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw
Reported-and-analyzed-by: Suren Baghdasaryan <surenb@google.com>
Fixes: 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org # v4.9+
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-09 20:07:59 -07:00
|
|
|
/*
|
|
|
|
* If the hpage allocated earlier was briefly exposed in page cache
|
|
|
|
* before collapse_file() failed, it is possible that racing lookups
|
|
|
|
* have not yet completed, and would then be unpleasantly surprised by
|
|
|
|
* finding the hpage reused for the same mapping at a different offset.
|
|
|
|
* Just release the previous allocation if there is any danger of that.
|
|
|
|
*/
|
|
|
|
if (*hpage && page_count(*hpage) > 1) {
|
|
|
|
put_page(*hpage);
|
|
|
|
*hpage = NULL;
|
|
|
|
}
|
|
|
|
|
2016-07-26 15:26:24 -07:00
|
|
|
if (!*hpage)
|
|
|
|
*hpage = khugepaged_alloc_hugepage(wait);
|
|
|
|
|
|
|
|
if (unlikely(!*hpage))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct page *
|
2016-07-26 15:26:26 -07:00
|
|
|
khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
|
2016-07-26 15:26:24 -07:00
|
|
|
{
|
|
|
|
VM_BUG_ON(!*hpage);
|
|
|
|
|
|
|
|
return *hpage;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
static bool hugepage_vma_check(struct vm_area_struct *vma)
|
|
|
|
{
|
|
|
|
if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
|
2017-07-10 15:48:02 -07:00
|
|
|
(vma->vm_flags & VM_NOHUGEPAGE) ||
|
|
|
|
test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
|
2016-07-26 15:26:24 -07:00
|
|
|
return false;
|
2016-07-26 15:26:32 -07:00
|
|
|
if (shmem_file(vma->vm_file)) {
|
2016-07-26 15:26:35 -07:00
|
|
|
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
|
|
|
|
return false;
|
2016-07-26 15:26:32 -07:00
|
|
|
return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
|
|
|
|
HPAGE_PMD_NR);
|
|
|
|
}
|
2016-07-26 15:26:24 -07:00
|
|
|
if (!vma->anon_vma || vma->vm_ops)
|
|
|
|
return false;
|
|
|
|
if (is_vma_temporary_stack(vma))
|
|
|
|
return false;
|
|
|
|
return !(vma->vm_flags & VM_NO_KHUGEPAGED);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If mmap_sem temporarily dropped, revalidate vma
|
|
|
|
* before taking mmap_sem.
|
|
|
|
* Return 0 if succeeds, otherwise return none-zero
|
|
|
|
* value (scan code).
|
|
|
|
*/
|
|
|
|
|
2016-09-19 14:44:01 -07:00
|
|
|
static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
|
|
|
|
struct vm_area_struct **vmap)
|
2016-07-26 15:26:24 -07:00
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
unsigned long hstart, hend;
|
|
|
|
|
|
|
|
if (unlikely(khugepaged_test_exit(mm)))
|
|
|
|
return SCAN_ANY_PROCESS;
|
|
|
|
|
2016-09-19 14:44:01 -07:00
|
|
|
*vmap = vma = find_vma(mm, address);
|
2016-07-26 15:26:24 -07:00
|
|
|
if (!vma)
|
|
|
|
return SCAN_VMA_NULL;
|
|
|
|
|
|
|
|
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
|
|
|
|
hend = vma->vm_end & HPAGE_PMD_MASK;
|
|
|
|
if (address < hstart || address + HPAGE_PMD_SIZE > hend)
|
|
|
|
return SCAN_ADDRESS_RANGE;
|
|
|
|
if (!hugepage_vma_check(vma))
|
|
|
|
return SCAN_VMA_CHECK;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Bring missing pages in from swap, to complete THP collapse.
|
|
|
|
* Only done if khugepaged_scan_pmd believes it is worthwhile.
|
|
|
|
*
|
|
|
|
* Called and returns without pte mapped or spinlocks held,
|
|
|
|
* but with mmap_sem held to protect against vma changes.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static bool __collapse_huge_page_swapin(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
2016-07-26 15:26:46 -07:00
|
|
|
unsigned long address, pmd_t *pmd,
|
|
|
|
int referenced)
|
2016-07-26 15:26:24 -07:00
|
|
|
{
|
|
|
|
int swapped_in = 0, ret = 0;
|
2016-12-14 15:06:58 -08:00
|
|
|
struct vm_fault vmf = {
|
2016-07-26 15:26:24 -07:00
|
|
|
.vma = vma,
|
|
|
|
.address = address,
|
|
|
|
.flags = FAULT_FLAG_ALLOW_RETRY,
|
|
|
|
.pmd = pmd,
|
2016-12-14 15:07:04 -08:00
|
|
|
.pgoff = linear_page_index(vma, address),
|
2016-07-26 15:26:24 -07:00
|
|
|
};
|
|
|
|
|
2016-09-19 14:44:04 -07:00
|
|
|
/* we only decide to swapin, if there is enough young ptes */
|
|
|
|
if (referenced < HPAGE_PMD_NR/2) {
|
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
|
|
|
|
return false;
|
|
|
|
}
|
2016-12-14 15:06:58 -08:00
|
|
|
vmf.pte = pte_offset_map(pmd, address);
|
|
|
|
for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE;
|
|
|
|
vmf.pte++, vmf.address += PAGE_SIZE) {
|
2016-12-14 15:07:16 -08:00
|
|
|
vmf.orig_pte = *vmf.pte;
|
|
|
|
if (!is_swap_pte(vmf.orig_pte))
|
2016-07-26 15:26:24 -07:00
|
|
|
continue;
|
|
|
|
swapped_in++;
|
2016-12-14 15:07:16 -08:00
|
|
|
ret = do_swap_page(&vmf);
|
2016-07-26 15:26:46 -07:00
|
|
|
|
2016-07-26 15:26:24 -07:00
|
|
|
/* do_swap_page returns VM_FAULT_RETRY with released mmap_sem */
|
|
|
|
if (ret & VM_FAULT_RETRY) {
|
|
|
|
down_read(&mm->mmap_sem);
|
2016-12-14 15:06:58 -08:00
|
|
|
if (hugepage_vma_revalidate(mm, address, &vmf.vma)) {
|
2016-07-26 15:26:43 -07:00
|
|
|
/* vma is no longer available, don't continue to swapin */
|
2016-07-26 15:26:46 -07:00
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
|
2016-07-26 15:26:24 -07:00
|
|
|
return false;
|
2016-07-26 15:26:43 -07:00
|
|
|
}
|
2016-07-26 15:26:24 -07:00
|
|
|
/* check if the pmd is still valid */
|
2017-05-12 15:46:38 -07:00
|
|
|
if (mm_find_pmd(mm, address) != pmd) {
|
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
|
2016-07-26 15:26:24 -07:00
|
|
|
return false;
|
2017-05-12 15:46:38 -07:00
|
|
|
}
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
|
|
|
if (ret & VM_FAULT_ERROR) {
|
2016-07-26 15:26:46 -07:00
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
|
2016-07-26 15:26:24 -07:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
/* pte is unmapped now, we need to map it */
|
2016-12-14 15:06:58 -08:00
|
|
|
vmf.pte = pte_offset_map(pmd, vmf.address);
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
2016-12-14 15:06:58 -08:00
|
|
|
vmf.pte--;
|
|
|
|
pte_unmap(vmf.pte);
|
2016-07-26 15:26:46 -07:00
|
|
|
trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1);
|
2016-07-26 15:26:24 -07:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void collapse_huge_page(struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
|
|
|
struct page **hpage,
|
2016-07-26 15:26:46 -07:00
|
|
|
int node, int referenced)
|
2016-07-26 15:26:24 -07:00
|
|
|
{
|
|
|
|
pmd_t *pmd, _pmd;
|
|
|
|
pte_t *pte;
|
|
|
|
pgtable_t pgtable;
|
|
|
|
struct page *new_page;
|
|
|
|
spinlock_t *pmd_ptl, *pte_ptl;
|
|
|
|
int isolated = 0, result = 0;
|
|
|
|
struct mem_cgroup *memcg;
|
2016-09-19 14:44:01 -07:00
|
|
|
struct vm_area_struct *vma;
|
2016-07-26 15:26:24 -07:00
|
|
|
unsigned long mmun_start; /* For mmu_notifiers */
|
|
|
|
unsigned long mmun_end; /* For mmu_notifiers */
|
|
|
|
gfp_t gfp;
|
|
|
|
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
|
|
|
|
|
|
|
/* Only allocate from the target node */
|
2017-01-10 16:57:42 -08:00
|
|
|
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
|
2016-07-26 15:26:24 -07:00
|
|
|
|
2016-07-26 15:26:26 -07:00
|
|
|
/*
|
|
|
|
* Before allocating the hugepage, release the mmap_sem read lock.
|
|
|
|
* The allocation can take potentially a long time if it involves
|
|
|
|
* sync compaction, and we do not need to hold the mmap_sem during
|
|
|
|
* that. We will recheck the vma after taking it again in write mode.
|
|
|
|
*/
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
new_page = khugepaged_alloc_page(hpage, gfp, node);
|
2016-07-26 15:26:24 -07:00
|
|
|
if (!new_page) {
|
|
|
|
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
|
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
2018-03-22 16:17:45 -07:00
|
|
|
/* Do not oom kill for khugepaged charges */
|
|
|
|
if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp | __GFP_NORETRY,
|
|
|
|
&memcg, true))) {
|
2016-07-26 15:26:24 -07:00
|
|
|
result = SCAN_CGROUP_CHARGE_FAIL;
|
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
|
|
|
down_read(&mm->mmap_sem);
|
2016-09-19 14:44:01 -07:00
|
|
|
result = hugepage_vma_revalidate(mm, address, &vma);
|
2016-07-26 15:26:24 -07:00
|
|
|
if (result) {
|
|
|
|
mem_cgroup_cancel_charge(new_page, memcg, true);
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
|
|
|
pmd = mm_find_pmd(mm, address);
|
|
|
|
if (!pmd) {
|
|
|
|
result = SCAN_PMD_NULL;
|
|
|
|
mem_cgroup_cancel_charge(new_page, memcg, true);
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* __collapse_huge_page_swapin always returns with mmap_sem locked.
|
2016-07-26 15:26:43 -07:00
|
|
|
* If it fails, we release mmap_sem and jump out_nolock.
|
2016-07-26 15:26:24 -07:00
|
|
|
* Continuing to collapse causes inconsistency.
|
|
|
|
*/
|
2016-07-26 15:26:46 -07:00
|
|
|
if (!__collapse_huge_page_swapin(mm, vma, address, pmd, referenced)) {
|
2016-07-26 15:26:24 -07:00
|
|
|
mem_cgroup_cancel_charge(new_page, memcg, true);
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
goto out_nolock;
|
|
|
|
}
|
|
|
|
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
/*
|
|
|
|
* Prevent all access to pagetables with the exception of
|
|
|
|
* gup_fast later handled by the ptep_clear_flush and the VM
|
|
|
|
* handled by the anon_vma lock + PG_lock.
|
|
|
|
*/
|
|
|
|
down_write(&mm->mmap_sem);
|
2016-09-19 14:44:01 -07:00
|
|
|
result = hugepage_vma_revalidate(mm, address, &vma);
|
2016-07-26 15:26:24 -07:00
|
|
|
if (result)
|
|
|
|
goto out;
|
|
|
|
/* check if the pmd is still valid */
|
|
|
|
if (mm_find_pmd(mm, address) != pmd)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
anon_vma_lock_write(vma->anon_vma);
|
|
|
|
|
|
|
|
pte = pte_offset_map(pmd, address);
|
|
|
|
pte_ptl = pte_lockptr(mm, pmd);
|
|
|
|
|
|
|
|
mmun_start = address;
|
|
|
|
mmun_end = address + HPAGE_PMD_SIZE;
|
|
|
|
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
|
|
|
|
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
|
|
|
|
/*
|
|
|
|
* After this gup_fast can't run anymore. This also removes
|
|
|
|
* any huge TLB entry from the CPU so we won't allow
|
|
|
|
* huge and small TLB entries for the same virtual address
|
|
|
|
* to avoid the risk of CPU bugs in that area.
|
|
|
|
*/
|
|
|
|
_pmd = pmdp_collapse_flush(vma, address, pmd);
|
|
|
|
spin_unlock(pmd_ptl);
|
|
|
|
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
|
2022-11-25 22:37:13 +01:00
|
|
|
tlb_remove_table_sync_one();
|
2016-07-26 15:26:24 -07:00
|
|
|
|
|
|
|
spin_lock(pte_ptl);
|
|
|
|
isolated = __collapse_huge_page_isolate(vma, address, pte);
|
|
|
|
spin_unlock(pte_ptl);
|
|
|
|
|
|
|
|
if (unlikely(!isolated)) {
|
|
|
|
pte_unmap(pte);
|
|
|
|
spin_lock(pmd_ptl);
|
|
|
|
BUG_ON(!pmd_none(*pmd));
|
|
|
|
/*
|
|
|
|
* We can only use set_pmd_at when establishing
|
|
|
|
* hugepmds and never for establishing regular pmds that
|
|
|
|
* points to regular pagetables. Use pmd_populate for that
|
|
|
|
*/
|
|
|
|
pmd_populate(mm, pmd, pmd_pgtable(_pmd));
|
|
|
|
spin_unlock(pmd_ptl);
|
|
|
|
anon_vma_unlock_write(vma->anon_vma);
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* All pages are isolated and locked so anon_vma rmap
|
|
|
|
* can't run anymore.
|
|
|
|
*/
|
|
|
|
anon_vma_unlock_write(vma->anon_vma);
|
|
|
|
|
|
|
|
__collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl);
|
|
|
|
pte_unmap(pte);
|
|
|
|
__SetPageUptodate(new_page);
|
|
|
|
pgtable = pmd_pgtable(_pmd);
|
|
|
|
|
|
|
|
_pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
|
|
|
|
_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* spin_lock() below is not the equivalent of smp_wmb(), so
|
|
|
|
* this is needed to avoid the copy_huge_page writes to become
|
|
|
|
* visible after the set_pmd_at() write.
|
|
|
|
*/
|
|
|
|
smp_wmb();
|
|
|
|
|
|
|
|
spin_lock(pmd_ptl);
|
|
|
|
BUG_ON(!pmd_none(*pmd));
|
|
|
|
page_add_new_anon_rmap(new_page, vma, address, true);
|
|
|
|
mem_cgroup_commit_charge(new_page, memcg, false, true);
|
|
|
|
lru_cache_add_active_or_unevictable(new_page, vma);
|
|
|
|
pgtable_trans_huge_deposit(mm, pmd, pgtable);
|
|
|
|
set_pmd_at(mm, address, pmd, _pmd);
|
|
|
|
update_mmu_cache_pmd(vma, address, pmd);
|
|
|
|
spin_unlock(pmd_ptl);
|
|
|
|
|
|
|
|
*hpage = NULL;
|
|
|
|
|
|
|
|
khugepaged_pages_collapsed++;
|
|
|
|
result = SCAN_SUCCEED;
|
|
|
|
out_up_write:
|
|
|
|
up_write(&mm->mmap_sem);
|
|
|
|
out_nolock:
|
|
|
|
trace_mm_collapse_huge_page(mm, isolated, result);
|
|
|
|
return;
|
|
|
|
out:
|
|
|
|
mem_cgroup_cancel_charge(new_page, memcg, true);
|
|
|
|
goto out_up_write;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_scan_pmd(struct mm_struct *mm,
|
|
|
|
struct vm_area_struct *vma,
|
|
|
|
unsigned long address,
|
|
|
|
struct page **hpage)
|
|
|
|
{
|
|
|
|
pmd_t *pmd;
|
|
|
|
pte_t *pte, *_pte;
|
2016-07-26 15:26:46 -07:00
|
|
|
int ret = 0, none_or_zero = 0, result = 0, referenced = 0;
|
2016-07-26 15:26:24 -07:00
|
|
|
struct page *page = NULL;
|
|
|
|
unsigned long _address;
|
|
|
|
spinlock_t *ptl;
|
|
|
|
int node = NUMA_NO_NODE, unmapped = 0;
|
2016-07-26 15:26:46 -07:00
|
|
|
bool writable = false;
|
2016-07-26 15:26:24 -07:00
|
|
|
|
|
|
|
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
|
|
|
|
|
|
|
|
pmd = mm_find_pmd(mm, address);
|
|
|
|
if (!pmd) {
|
|
|
|
result = SCAN_PMD_NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
|
|
|
|
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
|
|
|
|
for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
|
|
|
|
_pte++, _address += PAGE_SIZE) {
|
|
|
|
pte_t pteval = *_pte;
|
|
|
|
if (is_swap_pte(pteval)) {
|
|
|
|
if (++unmapped <= khugepaged_max_ptes_swap) {
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
result = SCAN_EXCEED_SWAP_PTE;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
|
|
|
|
if (!userfaultfd_armed(vma) &&
|
|
|
|
++none_or_zero <= khugepaged_max_ptes_none) {
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
result = SCAN_EXCEED_NONE_PTE;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!pte_present(pteval)) {
|
|
|
|
result = SCAN_PTE_NON_PRESENT;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
if (pte_write(pteval))
|
|
|
|
writable = true;
|
|
|
|
|
|
|
|
page = vm_normal_page(vma, _address, pteval);
|
|
|
|
if (unlikely(!page)) {
|
|
|
|
result = SCAN_PAGE_NULL;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* TODO: teach khugepaged to collapse THP mapped with pte */
|
|
|
|
if (PageCompound(page)) {
|
|
|
|
result = SCAN_PAGE_COMPOUND;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Record which node the original page is from and save this
|
|
|
|
* information to khugepaged_node_load[].
|
|
|
|
* Khupaged will allocate hugepage from the node has the max
|
|
|
|
* hit record.
|
|
|
|
*/
|
|
|
|
node = page_to_nid(page);
|
|
|
|
if (khugepaged_scan_abort(node)) {
|
|
|
|
result = SCAN_SCAN_ABORT;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
khugepaged_node_load[node]++;
|
|
|
|
if (!PageLRU(page)) {
|
|
|
|
result = SCAN_PAGE_LRU;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
if (PageLocked(page)) {
|
|
|
|
result = SCAN_PAGE_LOCK;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
if (!PageAnon(page)) {
|
|
|
|
result = SCAN_PAGE_ANON;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* cannot use mapcount: can't collapse if there's a gup pin.
|
|
|
|
* The page must only be referenced by the scanned process
|
|
|
|
* and page swap cache.
|
|
|
|
*/
|
2017-05-03 14:53:35 -07:00
|
|
|
if (page_count(page) != 1 + PageSwapCache(page)) {
|
2016-07-26 15:26:24 -07:00
|
|
|
result = SCAN_PAGE_COUNT;
|
|
|
|
goto out_unmap;
|
|
|
|
}
|
|
|
|
if (pte_young(pteval) ||
|
|
|
|
page_is_young(page) || PageReferenced(page) ||
|
|
|
|
mmu_notifier_test_young(vma->vm_mm, address))
|
2016-07-26 15:26:46 -07:00
|
|
|
referenced++;
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
|
|
|
if (writable) {
|
|
|
|
if (referenced) {
|
|
|
|
result = SCAN_SUCCEED;
|
|
|
|
ret = 1;
|
|
|
|
} else {
|
2016-07-26 15:26:46 -07:00
|
|
|
result = SCAN_LACK_REFERENCED_PAGE;
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
result = SCAN_PAGE_RO;
|
|
|
|
}
|
|
|
|
out_unmap:
|
|
|
|
pte_unmap_unlock(pte, ptl);
|
|
|
|
if (ret) {
|
|
|
|
node = khugepaged_find_target_node();
|
|
|
|
/* collapse_huge_page will return with the mmap_sem released */
|
2016-09-19 14:44:01 -07:00
|
|
|
collapse_huge_page(mm, address, hpage, node, referenced);
|
2016-07-26 15:26:24 -07:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced,
|
|
|
|
none_or_zero, result, unmapped);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void collect_mm_slot(struct mm_slot *mm_slot)
|
|
|
|
{
|
|
|
|
struct mm_struct *mm = mm_slot->mm;
|
|
|
|
|
|
|
|
VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
|
|
|
|
|
|
|
|
if (khugepaged_test_exit(mm)) {
|
|
|
|
/* free mm_slot */
|
|
|
|
hash_del(&mm_slot->hash);
|
|
|
|
list_del(&mm_slot->mm_node);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Not strictly needed because the mm exited already.
|
|
|
|
*
|
|
|
|
* clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
|
|
|
|
*/
|
|
|
|
|
|
|
|
/* khugepaged_mm_lock actually not necessary for the below */
|
|
|
|
free_mm_slot(mm_slot);
|
|
|
|
mmdrop(mm);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-07-26 15:26:35 -07:00
|
|
|
#if defined(CONFIG_SHMEM) && defined(CONFIG_TRANSPARENT_HUGE_PAGECACHE)
|
2016-07-26 15:26:32 -07:00
|
|
|
static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
khugepaged: retract_page_tables() remember to test exit
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-06 23:26:22 -07:00
|
|
|
struct mm_struct *mm;
|
2016-07-26 15:26:32 -07:00
|
|
|
unsigned long addr;
|
|
|
|
pmd_t *pmd, _pmd;
|
|
|
|
|
|
|
|
i_mmap_lock_write(mapping);
|
|
|
|
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
|
|
|
|
/* probably overkill */
|
|
|
|
if (vma->anon_vma)
|
|
|
|
continue;
|
|
|
|
addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
|
|
|
|
if (addr & ~HPAGE_PMD_MASK)
|
|
|
|
continue;
|
|
|
|
if (vma->vm_end < addr + HPAGE_PMD_SIZE)
|
|
|
|
continue;
|
khugepaged: retract_page_tables() remember to test exit
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-06 23:26:22 -07:00
|
|
|
mm = vma->vm_mm;
|
|
|
|
pmd = mm_find_pmd(mm, addr);
|
2016-07-26 15:26:32 -07:00
|
|
|
if (!pmd)
|
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* We need exclusive mmap_sem to retract page table.
|
|
|
|
* If trylock fails we would end up with pte-mapped THP after
|
|
|
|
* re-fault. Not ideal, but it's more important to not disturb
|
|
|
|
* the system too much.
|
|
|
|
*/
|
khugepaged: retract_page_tables() remember to test exit
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-06 23:26:22 -07:00
|
|
|
if (down_write_trylock(&mm->mmap_sem)) {
|
|
|
|
if (!khugepaged_test_exit(mm)) {
|
2022-11-25 22:37:14 +01:00
|
|
|
spinlock_t *ptl;
|
|
|
|
unsigned long end = addr + HPAGE_PMD_SIZE;
|
|
|
|
|
|
|
|
mmu_notifier_invalidate_range_start(mm, addr,
|
|
|
|
end);
|
|
|
|
ptl = pmd_lock(mm, pmd);
|
khugepaged: retract_page_tables() remember to test exit
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-06 23:26:22 -07:00
|
|
|
/* assume page table is clear */
|
|
|
|
_pmd = pmdp_collapse_flush(vma, addr, pmd);
|
|
|
|
spin_unlock(ptl);
|
|
|
|
atomic_long_dec(&mm->nr_ptes);
|
2022-11-25 22:37:13 +01:00
|
|
|
tlb_remove_table_sync_one();
|
khugepaged: retract_page_tables() remember to test exit
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-06 23:26:22 -07:00
|
|
|
pte_free(mm, pmd_pgtable(_pmd));
|
2022-11-25 22:37:14 +01:00
|
|
|
mmu_notifier_invalidate_range_end(mm, addr,
|
|
|
|
end);
|
khugepaged: retract_page_tables() remember to test exit
commit 18e77600f7a1ed69f8ce46c9e11cad0985712dfa upstream.
Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example. But retract_page_tables() did not: fix that.
The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org> [4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-06 23:26:22 -07:00
|
|
|
}
|
|
|
|
up_write(&mm->mmap_sem);
|
2016-07-26 15:26:32 -07:00
|
|
|
}
|
|
|
|
}
|
|
|
|
i_mmap_unlock_write(mapping);
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* collapse_shmem - collapse small tmpfs/shmem pages into huge one.
|
|
|
|
*
|
|
|
|
* Basic scheme is simple, details are more complex:
|
mm/khugepaged: collapse_shmem() without freezing new_page
commit 87c460a0bded56195b5eb497d44709777ef7b415 upstream.
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-30 14:10:43 -08:00
|
|
|
* - allocate and lock a new huge page;
|
2016-07-26 15:26:32 -07:00
|
|
|
* - scan over radix tree replacing old pages the new one
|
|
|
|
* + swap in pages if necessary;
|
|
|
|
* + fill in gaps;
|
|
|
|
* + keep old pages around in case if rollback is required;
|
|
|
|
* - if replacing succeed:
|
|
|
|
* + copy data over;
|
|
|
|
* + free old pages;
|
mm/khugepaged: collapse_shmem() without freezing new_page
commit 87c460a0bded56195b5eb497d44709777ef7b415 upstream.
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-30 14:10:43 -08:00
|
|
|
* + unlock huge page;
|
2016-07-26 15:26:32 -07:00
|
|
|
* - if replacing failed;
|
|
|
|
* + put all pages back and unfreeze them;
|
|
|
|
* + restore gaps in the radix-tree;
|
mm/khugepaged: collapse_shmem() without freezing new_page
commit 87c460a0bded56195b5eb497d44709777ef7b415 upstream.
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-30 14:10:43 -08:00
|
|
|
* + unlock and free huge page;
|
2016-07-26 15:26:32 -07:00
|
|
|
*/
|
|
|
|
static void collapse_shmem(struct mm_struct *mm,
|
|
|
|
struct address_space *mapping, pgoff_t start,
|
|
|
|
struct page **hpage, int node)
|
|
|
|
{
|
|
|
|
gfp_t gfp;
|
|
|
|
struct page *page, *new_page, *tmp;
|
|
|
|
struct mem_cgroup *memcg;
|
|
|
|
pgoff_t index, end = start + HPAGE_PMD_NR;
|
|
|
|
LIST_HEAD(pagelist);
|
|
|
|
struct radix_tree_iter iter;
|
|
|
|
void **slot;
|
|
|
|
int nr_none = 0, result = SCAN_SUCCEED;
|
|
|
|
|
|
|
|
VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
|
|
|
|
|
|
|
|
/* Only allocate from the target node */
|
2017-01-10 16:57:42 -08:00
|
|
|
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
|
2016-07-26 15:26:32 -07:00
|
|
|
|
|
|
|
new_page = khugepaged_alloc_page(hpage, gfp, node);
|
|
|
|
if (!new_page) {
|
|
|
|
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-03-22 16:17:45 -07:00
|
|
|
/* Do not oom kill for khugepaged charges */
|
|
|
|
if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp | __GFP_NORETRY,
|
|
|
|
&memcg, true))) {
|
2016-07-26 15:26:32 -07:00
|
|
|
result = SCAN_CGROUP_CHARGE_FAIL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2018-11-30 14:10:39 -08:00
|
|
|
__SetPageLocked(new_page);
|
|
|
|
__SetPageSwapBacked(new_page);
|
2016-07-26 15:26:32 -07:00
|
|
|
new_page->index = start;
|
|
|
|
new_page->mapping = mapping;
|
|
|
|
|
|
|
|
/*
|
mm/khugepaged: collapse_shmem() without freezing new_page
commit 87c460a0bded56195b5eb497d44709777ef7b415 upstream.
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-30 14:10:43 -08:00
|
|
|
* At this point the new_page is locked and not up-to-date.
|
|
|
|
* It's safe to insert it into the page cache, because nobody would
|
|
|
|
* be able to map it or use it in another way until we unlock it.
|
2016-07-26 15:26:32 -07:00
|
|
|
*/
|
|
|
|
|
|
|
|
index = start;
|
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
|
|
|
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
|
|
|
|
int n = min(iter.index, end) - index;
|
|
|
|
|
2018-11-30 14:10:25 -08:00
|
|
|
/*
|
|
|
|
* Stop if extent has been hole-punched, and is now completely
|
|
|
|
* empty (the more obvious i_size_read() check would take an
|
|
|
|
* irq-unsafe seqlock on 32-bit).
|
|
|
|
*/
|
|
|
|
if (n >= HPAGE_PMD_NR) {
|
|
|
|
result = SCAN_TRUNCATED;
|
|
|
|
goto tree_locked;
|
|
|
|
}
|
|
|
|
|
2016-07-26 15:26:32 -07:00
|
|
|
/*
|
|
|
|
* Handle holes in the radix tree: charge it from shmem and
|
|
|
|
* insert relevant subpage of new_page into the radix-tree.
|
|
|
|
*/
|
|
|
|
if (n && !shmem_charge(mapping->host, n)) {
|
|
|
|
result = SCAN_FAIL;
|
2018-11-30 14:10:39 -08:00
|
|
|
goto tree_locked;
|
2016-07-26 15:26:32 -07:00
|
|
|
}
|
|
|
|
for (; index < min(iter.index, end); index++) {
|
|
|
|
radix_tree_insert(&mapping->page_tree, index,
|
|
|
|
new_page + (index % HPAGE_PMD_NR));
|
|
|
|
}
|
2018-11-30 14:10:39 -08:00
|
|
|
nr_none += n;
|
2016-07-26 15:26:32 -07:00
|
|
|
|
|
|
|
/* We are done. */
|
|
|
|
if (index >= end)
|
|
|
|
break;
|
|
|
|
|
|
|
|
page = radix_tree_deref_slot_protected(slot,
|
|
|
|
&mapping->tree_lock);
|
|
|
|
if (radix_tree_exceptional_entry(page) || !PageUptodate(page)) {
|
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
/* swap in or instantiate fallocated page */
|
|
|
|
if (shmem_getpage(mapping->host, index, &page,
|
|
|
|
SGP_NOHUGE)) {
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto tree_unlocked;
|
|
|
|
}
|
|
|
|
} else if (trylock_page(page)) {
|
|
|
|
get_page(page);
|
2018-11-30 14:10:39 -08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
2016-07-26 15:26:32 -07:00
|
|
|
} else {
|
|
|
|
result = SCAN_PAGE_LOCK;
|
2018-11-30 14:10:39 -08:00
|
|
|
goto tree_locked;
|
2016-07-26 15:26:32 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The page must be locked, so we can drop the tree_lock
|
|
|
|
* without racing with truncate.
|
|
|
|
*/
|
|
|
|
VM_BUG_ON_PAGE(!PageLocked(page), page);
|
|
|
|
VM_BUG_ON_PAGE(!PageUptodate(page), page);
|
2018-11-30 14:10:47 -08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If file was truncated then extended, or hole-punched, before
|
|
|
|
* we locked the first page, then a THP might be there already.
|
|
|
|
*/
|
|
|
|
if (PageTransCompound(page)) {
|
|
|
|
result = SCAN_PAGE_COMPOUND;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
2016-07-26 15:26:32 -07:00
|
|
|
|
|
|
|
if (page_mapping(page) != mapping) {
|
|
|
|
result = SCAN_TRUNCATED;
|
|
|
|
goto out_unlock;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (isolate_lru_page(page)) {
|
|
|
|
result = SCAN_DEL_PAGE_LRU;
|
2018-11-30 14:10:39 -08:00
|
|
|
goto out_unlock;
|
2016-07-26 15:26:32 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
if (page_mapped(page))
|
|
|
|
unmap_mapping_range(mapping, index << PAGE_SHIFT,
|
|
|
|
PAGE_SIZE, 0);
|
|
|
|
|
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
|
|
|
|
2016-12-12 16:43:32 -08:00
|
|
|
slot = radix_tree_lookup_slot(&mapping->page_tree, index);
|
|
|
|
VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
|
|
|
|
&mapping->tree_lock), page);
|
2016-07-26 15:26:32 -07:00
|
|
|
VM_BUG_ON_PAGE(page_mapped(page), page);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The page is expected to have page_count() == 3:
|
|
|
|
* - we hold a pin on it;
|
|
|
|
* - one reference from radix tree;
|
|
|
|
* - one from isolate_lru_page;
|
|
|
|
*/
|
|
|
|
if (!page_ref_freeze(page, 3)) {
|
|
|
|
result = SCAN_PAGE_COUNT;
|
2018-11-30 14:10:39 -08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
putback_lru_page(page);
|
|
|
|
goto out_unlock;
|
2016-07-26 15:26:32 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add the page to the list to be able to undo the collapse if
|
|
|
|
* something go wrong.
|
|
|
|
*/
|
|
|
|
list_add_tail(&page->lru, &pagelist);
|
|
|
|
|
|
|
|
/* Finally, replace with the new page. */
|
2016-12-12 16:43:43 -08:00
|
|
|
radix_tree_replace_slot(&mapping->page_tree, slot,
|
2016-07-26 15:26:32 -07:00
|
|
|
new_page + (index % HPAGE_PMD_NR));
|
|
|
|
|
2016-12-14 15:08:49 -08:00
|
|
|
slot = radix_tree_iter_resume(slot, &iter);
|
2016-07-26 15:26:32 -07:00
|
|
|
index++;
|
|
|
|
continue;
|
|
|
|
out_unlock:
|
|
|
|
unlock_page(page);
|
|
|
|
put_page(page);
|
2018-11-30 14:10:39 -08:00
|
|
|
goto tree_unlocked;
|
2016-07-26 15:26:32 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Handle hole in radix tree at the end of the range.
|
|
|
|
* This code only triggers if there's nothing in radix tree
|
|
|
|
* beyond 'end'.
|
|
|
|
*/
|
2018-11-30 14:10:39 -08:00
|
|
|
if (index < end) {
|
2016-07-26 15:26:32 -07:00
|
|
|
int n = end - index;
|
|
|
|
|
2018-11-30 14:10:25 -08:00
|
|
|
/* Stop if extent has been truncated, and is now empty */
|
|
|
|
if (n >= HPAGE_PMD_NR) {
|
|
|
|
result = SCAN_TRUNCATED;
|
|
|
|
goto tree_locked;
|
|
|
|
}
|
2016-07-26 15:26:32 -07:00
|
|
|
if (!shmem_charge(mapping->host, n)) {
|
|
|
|
result = SCAN_FAIL;
|
|
|
|
goto tree_locked;
|
|
|
|
}
|
|
|
|
for (; index < end; index++) {
|
|
|
|
radix_tree_insert(&mapping->page_tree, index,
|
|
|
|
new_page + (index % HPAGE_PMD_NR));
|
|
|
|
}
|
|
|
|
nr_none += n;
|
|
|
|
}
|
|
|
|
|
2018-11-30 14:10:39 -08:00
|
|
|
__inc_node_page_state(new_page, NR_SHMEM_THPS);
|
|
|
|
if (nr_none) {
|
|
|
|
struct zone *zone = page_zone(new_page);
|
|
|
|
|
|
|
|
__mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
|
|
|
|
__mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
|
|
|
|
}
|
|
|
|
|
2016-07-26 15:26:32 -07:00
|
|
|
tree_locked:
|
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
tree_unlocked:
|
|
|
|
|
|
|
|
if (result == SCAN_SUCCEED) {
|
|
|
|
/*
|
|
|
|
* Replacing old pages with new one has succeed, now we need to
|
|
|
|
* copy the content and free old pages.
|
|
|
|
*/
|
2018-11-30 14:10:35 -08:00
|
|
|
index = start;
|
2016-07-26 15:26:32 -07:00
|
|
|
list_for_each_entry_safe(page, tmp, &pagelist, lru) {
|
2018-11-30 14:10:35 -08:00
|
|
|
while (index < page->index) {
|
|
|
|
clear_highpage(new_page + (index % HPAGE_PMD_NR));
|
|
|
|
index++;
|
|
|
|
}
|
2016-07-26 15:26:32 -07:00
|
|
|
copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
|
|
|
|
page);
|
|
|
|
list_del(&page->lru);
|
|
|
|
page->mapping = NULL;
|
2018-11-30 14:10:39 -08:00
|
|
|
page_ref_unfreeze(page, 1);
|
2016-07-26 15:26:32 -07:00
|
|
|
ClearPageActive(page);
|
|
|
|
ClearPageUnevictable(page);
|
2018-11-30 14:10:39 -08:00
|
|
|
unlock_page(page);
|
2016-07-26 15:26:32 -07:00
|
|
|
put_page(page);
|
2018-11-30 14:10:35 -08:00
|
|
|
index++;
|
|
|
|
}
|
|
|
|
while (index < end) {
|
|
|
|
clear_highpage(new_page + (index % HPAGE_PMD_NR));
|
|
|
|
index++;
|
2016-07-26 15:26:32 -07:00
|
|
|
}
|
|
|
|
|
|
|
|
SetPageUptodate(new_page);
|
mm/khugepaged: collapse_shmem() without freezing new_page
commit 87c460a0bded56195b5eb497d44709777ef7b415 upstream.
khugepaged's collapse_shmem() does almost all of its work, to assemble
the huge new_page from 512 scattered old pages, with the new_page's
refcount frozen to 0 (and refcounts of all old pages so far also frozen
to 0). Including shmem_getpage() to read in any which were out on swap,
memory reclaim if necessary to allocate their intermediate pages, and
copying over all the data from old to new.
Imagine the frozen refcount as a spinlock held, but without any lock
debugging to highlight the abuse: it's not good, and under serious load
heads into lockups - speculative getters of the page are not expecting
to spin while khugepaged is rescheduled.
One can get a little further under load by hacking around elsewhere; but
fortunately, freezing the new_page turns out to have been entirely
unnecessary, with no hacks needed elsewhere.
The huge new_page lock is already held throughout, and guards all its
subpages as they are brought one by one into the page cache tree; and
anything reading the data in that page, without the lock, before it has
been marked PageUptodate, would already be in the wrong. So simply
eliminate the freezing of the new_page.
Each of the old pages remains frozen with refcount 0 after it has been
replaced by a new_page subpage in the page cache tree, until they are
all unfrozen on success or failure: just as before. They could be
unfrozen sooner, but cause no problem once no longer visible to
find_get_entry(), filemap_map_pages() and other speculative lookups.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261527570.2275@eggly.anvils
Fixes: f3f0e1d2150b2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org> [4.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-30 14:10:43 -08:00
|
|
|
page_ref_add(new_page, HPAGE_PMD_NR - 1);
|
2018-11-30 14:10:39 -08:00
|
|
|
set_page_dirty(new_page);
|
2016-07-26 15:26:32 -07:00
|
|
|
mem_cgroup_commit_charge(new_page, memcg, false, true);
|
|
|
|
lru_cache_add_anon(new_page);
|
|
|
|
|
2018-11-30 14:10:39 -08:00
|
|
|
/*
|
|
|
|
* Remove pte page tables, so we can re-fault the page as huge.
|
|
|
|
*/
|
|
|
|
retract_page_tables(mapping, start);
|
2016-07-26 15:26:32 -07:00
|
|
|
*hpage = NULL;
|
|
|
|
} else {
|
|
|
|
/* Something went wrong: rollback changes to the radix-tree */
|
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
2018-11-30 14:10:29 -08:00
|
|
|
mapping->nrpages -= nr_none;
|
|
|
|
shmem_uncharge(mapping->host, nr_none);
|
|
|
|
|
2016-07-26 15:26:32 -07:00
|
|
|
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
|
|
|
|
start) {
|
|
|
|
if (iter.index >= end)
|
|
|
|
break;
|
|
|
|
page = list_first_entry_or_null(&pagelist,
|
|
|
|
struct page, lru);
|
|
|
|
if (!page || iter.index < page->index) {
|
|
|
|
if (!nr_none)
|
|
|
|
break;
|
|
|
|
nr_none--;
|
2016-12-12 16:43:35 -08:00
|
|
|
/* Put holes back where they were */
|
|
|
|
radix_tree_delete(&mapping->page_tree,
|
|
|
|
iter.index);
|
2016-07-26 15:26:32 -07:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
VM_BUG_ON_PAGE(page->index != iter.index, page);
|
|
|
|
|
|
|
|
/* Unfreeze the page. */
|
|
|
|
list_del(&page->lru);
|
|
|
|
page_ref_unfreeze(page, 2);
|
2016-12-12 16:43:43 -08:00
|
|
|
radix_tree_replace_slot(&mapping->page_tree,
|
|
|
|
slot, page);
|
2016-12-14 15:08:49 -08:00
|
|
|
slot = radix_tree_iter_resume(slot, &iter);
|
2016-07-26 15:26:32 -07:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
unlock_page(page);
|
2018-11-30 14:10:39 -08:00
|
|
|
putback_lru_page(page);
|
2016-07-26 15:26:32 -07:00
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
|
|
|
}
|
|
|
|
VM_BUG_ON(nr_none);
|
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
|
|
|
|
mem_cgroup_cancel_charge(new_page, memcg, true);
|
|
|
|
new_page->mapping = NULL;
|
|
|
|
}
|
2018-11-30 14:10:39 -08:00
|
|
|
|
|
|
|
unlock_page(new_page);
|
2016-07-26 15:26:32 -07:00
|
|
|
out:
|
|
|
|
VM_BUG_ON(!list_empty(&pagelist));
|
|
|
|
/* TODO: tracepoints */
|
|
|
|
}
|
|
|
|
|
|
|
|
static void khugepaged_scan_shmem(struct mm_struct *mm,
|
|
|
|
struct address_space *mapping,
|
|
|
|
pgoff_t start, struct page **hpage)
|
|
|
|
{
|
|
|
|
struct page *page = NULL;
|
|
|
|
struct radix_tree_iter iter;
|
|
|
|
void **slot;
|
|
|
|
int present, swap;
|
|
|
|
int node = NUMA_NO_NODE;
|
|
|
|
int result = SCAN_SUCCEED;
|
|
|
|
|
|
|
|
present = 0;
|
|
|
|
swap = 0;
|
|
|
|
memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
|
|
|
|
rcu_read_lock();
|
|
|
|
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
|
|
|
|
if (iter.index >= start + HPAGE_PMD_NR)
|
|
|
|
break;
|
|
|
|
|
|
|
|
page = radix_tree_deref_slot(slot);
|
|
|
|
if (radix_tree_deref_retry(page)) {
|
|
|
|
slot = radix_tree_iter_retry(&iter);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (radix_tree_exception(page)) {
|
|
|
|
if (++swap > khugepaged_max_ptes_swap) {
|
|
|
|
result = SCAN_EXCEED_SWAP_PTE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (PageTransCompound(page)) {
|
|
|
|
result = SCAN_PAGE_COMPOUND;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
node = page_to_nid(page);
|
|
|
|
if (khugepaged_scan_abort(node)) {
|
|
|
|
result = SCAN_SCAN_ABORT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
khugepaged_node_load[node]++;
|
|
|
|
|
|
|
|
if (!PageLRU(page)) {
|
|
|
|
result = SCAN_PAGE_LRU;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page_count(page) != 1 + page_mapcount(page)) {
|
|
|
|
result = SCAN_PAGE_COUNT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We probably should check if the page is referenced here, but
|
|
|
|
* nobody would transfer pte_young() to PageReferenced() for us.
|
|
|
|
* And rmap walk here is just too costly...
|
|
|
|
*/
|
|
|
|
|
|
|
|
present++;
|
|
|
|
|
|
|
|
if (need_resched()) {
|
2016-12-14 15:08:49 -08:00
|
|
|
slot = radix_tree_iter_resume(slot, &iter);
|
2016-07-26 15:26:32 -07:00
|
|
|
cond_resched_rcu();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
if (result == SCAN_SUCCEED) {
|
|
|
|
if (present < HPAGE_PMD_NR - khugepaged_max_ptes_none) {
|
|
|
|
result = SCAN_EXCEED_NONE_PTE;
|
|
|
|
} else {
|
|
|
|
node = khugepaged_find_target_node();
|
|
|
|
collapse_shmem(mm, mapping, start, hpage, node);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* TODO: tracepoints */
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
static void khugepaged_scan_shmem(struct mm_struct *mm,
|
|
|
|
struct address_space *mapping,
|
|
|
|
pgoff_t start, struct page **hpage)
|
|
|
|
{
|
|
|
|
BUILD_BUG();
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2016-07-26 15:26:24 -07:00
|
|
|
static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
|
|
|
|
struct page **hpage)
|
|
|
|
__releases(&khugepaged_mm_lock)
|
|
|
|
__acquires(&khugepaged_mm_lock)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
struct mm_struct *mm;
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
int progress = 0;
|
|
|
|
|
|
|
|
VM_BUG_ON(!pages);
|
|
|
|
VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
|
|
|
|
|
|
|
|
if (khugepaged_scan.mm_slot)
|
|
|
|
mm_slot = khugepaged_scan.mm_slot;
|
|
|
|
else {
|
|
|
|
mm_slot = list_entry(khugepaged_scan.mm_head.next,
|
|
|
|
struct mm_slot, mm_node);
|
|
|
|
khugepaged_scan.address = 0;
|
|
|
|
khugepaged_scan.mm_slot = mm_slot;
|
|
|
|
}
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
|
|
|
|
mm = mm_slot->mm;
|
2018-01-31 16:18:28 -08:00
|
|
|
/*
|
|
|
|
* Don't wait for semaphore (to avoid long wait times). Just move to
|
|
|
|
* the next mm on the list.
|
|
|
|
*/
|
|
|
|
vma = NULL;
|
|
|
|
if (unlikely(!down_read_trylock(&mm->mmap_sem)))
|
|
|
|
goto breakouterloop_mmap_sem;
|
|
|
|
if (likely(!khugepaged_test_exit(mm)))
|
2016-07-26 15:26:24 -07:00
|
|
|
vma = find_vma(mm, khugepaged_scan.address);
|
|
|
|
|
|
|
|
progress++;
|
|
|
|
for (; vma; vma = vma->vm_next) {
|
|
|
|
unsigned long hstart, hend;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
if (unlikely(khugepaged_test_exit(mm))) {
|
|
|
|
progress++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (!hugepage_vma_check(vma)) {
|
|
|
|
skip:
|
|
|
|
progress++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
|
|
|
|
hend = vma->vm_end & HPAGE_PMD_MASK;
|
|
|
|
if (hstart >= hend)
|
|
|
|
goto skip;
|
|
|
|
if (khugepaged_scan.address > hend)
|
|
|
|
goto skip;
|
|
|
|
if (khugepaged_scan.address < hstart)
|
|
|
|
khugepaged_scan.address = hstart;
|
|
|
|
VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
|
|
|
|
|
|
|
|
while (khugepaged_scan.address < hend) {
|
|
|
|
int ret;
|
|
|
|
cond_resched();
|
|
|
|
if (unlikely(khugepaged_test_exit(mm)))
|
|
|
|
goto breakouterloop;
|
|
|
|
|
|
|
|
VM_BUG_ON(khugepaged_scan.address < hstart ||
|
|
|
|
khugepaged_scan.address + HPAGE_PMD_SIZE >
|
|
|
|
hend);
|
2016-07-26 15:26:32 -07:00
|
|
|
if (shmem_file(vma->vm_file)) {
|
2016-07-26 15:26:35 -07:00
|
|
|
struct file *file;
|
2016-07-26 15:26:32 -07:00
|
|
|
pgoff_t pgoff = linear_page_index(vma,
|
|
|
|
khugepaged_scan.address);
|
2016-07-26 15:26:35 -07:00
|
|
|
if (!shmem_huge_enabled(vma))
|
|
|
|
goto skip;
|
|
|
|
file = get_file(vma->vm_file);
|
2016-07-26 15:26:32 -07:00
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
ret = 1;
|
|
|
|
khugepaged_scan_shmem(mm, file->f_mapping,
|
|
|
|
pgoff, hpage);
|
|
|
|
fput(file);
|
|
|
|
} else {
|
|
|
|
ret = khugepaged_scan_pmd(mm, vma,
|
|
|
|
khugepaged_scan.address,
|
|
|
|
hpage);
|
|
|
|
}
|
2016-07-26 15:26:24 -07:00
|
|
|
/* move to next address */
|
|
|
|
khugepaged_scan.address += HPAGE_PMD_SIZE;
|
|
|
|
progress += HPAGE_PMD_NR;
|
|
|
|
if (ret)
|
|
|
|
/* we released mmap_sem so break loop */
|
|
|
|
goto breakouterloop_mmap_sem;
|
|
|
|
if (progress >= pages)
|
|
|
|
goto breakouterloop;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
breakouterloop:
|
|
|
|
up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
|
|
|
|
breakouterloop_mmap_sem:
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
|
|
|
|
/*
|
|
|
|
* Release the current mm_slot if this mm is about to die, or
|
|
|
|
* if we scanned all vmas of this mm.
|
|
|
|
*/
|
|
|
|
if (khugepaged_test_exit(mm) || !vma) {
|
|
|
|
/*
|
|
|
|
* Make sure that if mm_users is reaching zero while
|
|
|
|
* khugepaged runs here, khugepaged_exit will find
|
|
|
|
* mm_slot not pointing to the exiting mm.
|
|
|
|
*/
|
|
|
|
if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
|
|
|
|
khugepaged_scan.mm_slot = list_entry(
|
|
|
|
mm_slot->mm_node.next,
|
|
|
|
struct mm_slot, mm_node);
|
|
|
|
khugepaged_scan.address = 0;
|
|
|
|
} else {
|
|
|
|
khugepaged_scan.mm_slot = NULL;
|
|
|
|
khugepaged_full_scans++;
|
|
|
|
}
|
|
|
|
|
|
|
|
collect_mm_slot(mm_slot);
|
|
|
|
}
|
|
|
|
|
|
|
|
return progress;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_has_work(void)
|
|
|
|
{
|
|
|
|
return !list_empty(&khugepaged_scan.mm_head) &&
|
|
|
|
khugepaged_enabled();
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged_wait_event(void)
|
|
|
|
{
|
|
|
|
return !list_empty(&khugepaged_scan.mm_head) ||
|
|
|
|
kthread_should_stop();
|
|
|
|
}
|
|
|
|
|
|
|
|
static void khugepaged_do_scan(void)
|
|
|
|
{
|
|
|
|
struct page *hpage = NULL;
|
|
|
|
unsigned int progress = 0, pass_through_head = 0;
|
|
|
|
unsigned int pages = khugepaged_pages_to_scan;
|
|
|
|
bool wait = true;
|
|
|
|
|
|
|
|
barrier(); /* write khugepaged_pages_to_scan to local stack */
|
|
|
|
|
|
|
|
while (progress < pages) {
|
|
|
|
if (!khugepaged_prealloc_page(&hpage, &wait))
|
|
|
|
break;
|
|
|
|
|
|
|
|
cond_resched();
|
|
|
|
|
|
|
|
if (unlikely(kthread_should_stop() || try_to_freeze()))
|
|
|
|
break;
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
if (!khugepaged_scan.mm_slot)
|
|
|
|
pass_through_head++;
|
|
|
|
if (khugepaged_has_work() &&
|
|
|
|
pass_through_head < 2)
|
|
|
|
progress += khugepaged_scan_mm_slot(pages - progress,
|
|
|
|
&hpage);
|
|
|
|
else
|
|
|
|
progress = pages;
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!IS_ERR_OR_NULL(hpage))
|
|
|
|
put_page(hpage);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool khugepaged_should_wakeup(void)
|
|
|
|
{
|
|
|
|
return kthread_should_stop() ||
|
|
|
|
time_after_eq(jiffies, khugepaged_sleep_expire);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void khugepaged_wait_work(void)
|
|
|
|
{
|
|
|
|
if (khugepaged_has_work()) {
|
|
|
|
const unsigned long scan_sleep_jiffies =
|
|
|
|
msecs_to_jiffies(khugepaged_scan_sleep_millisecs);
|
|
|
|
|
|
|
|
if (!scan_sleep_jiffies)
|
|
|
|
return;
|
|
|
|
|
|
|
|
khugepaged_sleep_expire = jiffies + scan_sleep_jiffies;
|
|
|
|
wait_event_freezable_timeout(khugepaged_wait,
|
|
|
|
khugepaged_should_wakeup(),
|
|
|
|
scan_sleep_jiffies);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (khugepaged_enabled())
|
|
|
|
wait_event_freezable(khugepaged_wait, khugepaged_wait_event());
|
|
|
|
}
|
|
|
|
|
|
|
|
static int khugepaged(void *none)
|
|
|
|
{
|
|
|
|
struct mm_slot *mm_slot;
|
|
|
|
|
|
|
|
set_freezable();
|
|
|
|
set_user_nice(current, MAX_NICE);
|
|
|
|
|
|
|
|
while (!kthread_should_stop()) {
|
|
|
|
khugepaged_do_scan();
|
|
|
|
khugepaged_wait_work();
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&khugepaged_mm_lock);
|
|
|
|
mm_slot = khugepaged_scan.mm_slot;
|
|
|
|
khugepaged_scan.mm_slot = NULL;
|
|
|
|
if (mm_slot)
|
|
|
|
collect_mm_slot(mm_slot);
|
|
|
|
spin_unlock(&khugepaged_mm_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void set_recommended_min_free_kbytes(void)
|
|
|
|
{
|
|
|
|
struct zone *zone;
|
|
|
|
int nr_zones = 0;
|
|
|
|
unsigned long recommended_min;
|
|
|
|
|
|
|
|
for_each_populated_zone(zone)
|
|
|
|
nr_zones++;
|
|
|
|
|
|
|
|
/* Ensure 2 pageblocks are free to assist fragmentation avoidance */
|
|
|
|
recommended_min = pageblock_nr_pages * nr_zones * 2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure that on average at least two pageblocks are almost free
|
|
|
|
* of another type, one for a migratetype to fall back to and a
|
|
|
|
* second to avoid subsequent fallbacks of other types There are 3
|
|
|
|
* MIGRATE_TYPES we care about.
|
|
|
|
*/
|
|
|
|
recommended_min += pageblock_nr_pages * nr_zones *
|
|
|
|
MIGRATE_PCPTYPES * MIGRATE_PCPTYPES;
|
|
|
|
|
|
|
|
/* don't ever allow to reserve more than 5% of the lowmem */
|
|
|
|
recommended_min = min(recommended_min,
|
|
|
|
(unsigned long) nr_free_buffer_pages() / 20);
|
|
|
|
recommended_min <<= (PAGE_SHIFT-10);
|
|
|
|
|
|
|
|
if (recommended_min > min_free_kbytes) {
|
|
|
|
if (user_min_free_kbytes >= 0)
|
|
|
|
pr_info("raising min_free_kbytes from %d to %lu to help transparent hugepage allocations\n",
|
|
|
|
min_free_kbytes, recommended_min);
|
|
|
|
|
|
|
|
min_free_kbytes = recommended_min;
|
|
|
|
}
|
|
|
|
setup_per_zone_wmarks();
|
|
|
|
}
|
|
|
|
|
|
|
|
int start_stop_khugepaged(void)
|
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
mutex_lock(&khugepaged_mutex);
|
|
|
|
if (khugepaged_enabled()) {
|
|
|
|
if (!khugepaged_thread)
|
|
|
|
khugepaged_thread = kthread_run(khugepaged, NULL,
|
|
|
|
"khugepaged");
|
|
|
|
if (IS_ERR(khugepaged_thread)) {
|
|
|
|
pr_err("khugepaged: kthread_run(khugepaged) failed\n");
|
|
|
|
err = PTR_ERR(khugepaged_thread);
|
|
|
|
khugepaged_thread = NULL;
|
|
|
|
goto fail;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!list_empty(&khugepaged_scan.mm_head))
|
|
|
|
wake_up_interruptible(&khugepaged_wait);
|
|
|
|
|
|
|
|
set_recommended_min_free_kbytes();
|
|
|
|
} else if (khugepaged_thread) {
|
|
|
|
kthread_stop(khugepaged_thread);
|
|
|
|
khugepaged_thread = NULL;
|
|
|
|
}
|
|
|
|
fail:
|
|
|
|
mutex_unlock(&khugepaged_mutex);
|
|
|
|
return err;
|
|
|
|
}
|
2020-10-10 23:16:40 -07:00
|
|
|
|
|
|
|
void khugepaged_min_free_kbytes_update(void)
|
|
|
|
{
|
|
|
|
mutex_lock(&khugepaged_mutex);
|
|
|
|
if (khugepaged_enabled() && khugepaged_thread)
|
|
|
|
set_recommended_min_free_kbytes();
|
|
|
|
mutex_unlock(&khugepaged_mutex);
|
|
|
|
}
|