Provide infrastructure to do a speculative fault (not holding
mmap_sem).
The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.
Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Manage the newly introduced pte_spinlock() for speculative page
fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
other CPU's invalidating the TLB and requiring this CPU to catch the
inter processor's interrupt]
[Move define of FAULT_FLAG_SPECULATIVE here]
[Introduce __handle_speculative_fault() and add a check against
mm->mm_users in handle_speculative_fault() defined in mm.h]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Change-Id: I45cbe79a96c7aa234cace421a36550e7fb0b9368
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:24
[vinmenon@codeaurora.org: fix !CONFIG_NUMA build +
checkpatch fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.
To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock. As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock. The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.
To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1. Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.
This patch is preparing for 2 kind of VMA access :
- as usual, under the control of the mmap_sem,
- without holding the mmap_sem for the speculative page fault handler.
Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.
The patch is introducing 2 new functions:
- vma_get() to find a VMA based on an address by holding the new rwlock.
- vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.
When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.
When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.
[1] https://patchwork.kernel.org/patch/5108281/
Change-Id: I9ecc922b8efa4b28975cc6a8e9531284c24ac14e
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:23
[vinmenon@codeaurora.org: fix the return of put_vma]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
When dealing with speculative page fault handler, we may race with VMA
being split or merged. In this case the vma->vm_start and vm->vm_end
fields may not match the address the page fault is occurring.
This can only happens when the VMA is split but in that case, the
anon_vma pointer of the new VMA will be the same as the original one,
because in __split_vma the new->anon_vma is set to src->anon_vma when
*new = *vma.
So even if the VMA boundaries are not correct, the anon_vma pointer is
still valid.
If the VMA has been merged, then the VMA in which it has been merged
must have the same anon_vma pointer otherwise the merge can't be done.
So in all the case we know that the anon_vma is valid, since we have
checked before starting the speculative page fault that the anon_vma
pointer is valid for this VMA and since there is an anon_vma this
means that at one time a page has been backed and that before the VMA
is cleaned, the page table lock would have to be grab to clean the
PTE, and the anon_vma field is checked once the PTE is locked.
This patch introduce a new __page_add_new_anon_rmap() service which
doesn't check for the VMA boundaries, and create a new inline one
which do the check.
When called from a page fault handler, if this is not a speculative one,
there is a guarantee that vm_start and vm_end match the faulting address,
so this check is useless. In the context of the speculative page fault
handler, this check may be wrong but anon_vma is still valid as explained
above.
Change-Id: I72c47830181579f8c9618df879077d321653b5f1
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:22
[vinmenon@codeaurora.org: trivial merge conflict fixes +
checkpatch fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
When dealing with the speculative fault path we should use the VMA's field
cached value stored in the vm_fault structure.
Currently vm_normal_page() is using the pointer to the VMA to fetch the
vm_flags value. This patch provides a new __vm_normal_page() which is
receiving the vm_flags flags value as parameter.
Note: The speculative path is turned on for architecture providing support
for special PTE flag. So only the first block of vm_normal_page is used
during the speculative path.
Change-Id: I0f2c4ab1212fbca449bdf6e7993dafa0d41044bc
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:21
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The speculative page fault handler which is run without holding the
mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
is not guaranteed to remain constant.
Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
value parameter instead of the vma pointer.
Change-Id: I68decbe0f80847403127c45c97565e47512532e9
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:20
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
migrate_misplaced_page() is only called during the page fault handling so
it's better to pass the pointer to the struct vm_fault instead of the vma.
This way during the speculative page fault path the saved vma->vm_flags
could be used.
Change-Id: I254a7c9d91dca9ee8a9afd5eccd6de9af5dc8bc0
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:19
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
When handling speculative page fault, the vma->vm_flags and
vma->vm_page_prot fields are read once the page table lock is released. So
there is no more guarantee that these fields would not change in our back.
They will be saved in the vm_fault structure before the VMA is checked for
changes.
This patch also set the fields in hugetlb_no_page() and
__collapse_huge_page_swapin even if it is not need for the callee.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Change-Id: I9821f02ea32ef220b57b8bfd817992bbf71bbb1d
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:18
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The speculative page fault handler must be protected against anon_vma
changes. This is because page_add_new_anon_rmap() is called during the
speculative path.
In addition, don't try speculative page fault if the VMA don't have an
anon_vma structure allocated because its allocation should be
protected by the mmap_sem.
In __vma_adjust() when importer->anon_vma is set, there is no need to
protect against speculative page faults since speculative page fault
is aborted if the vma->anon_vma is not set.
When calling page_add_new_anon_rmap() vma->anon_vma is necessarily
valid since we checked for it when locking the pte and the anon_vma is
removed once the pte is unlocked. So even if the speculative page
fault handler is running concurrently with do_unmap(), as the pte is
locked in unmap_region() - through unmap_vmas() - and the anon_vma
unlinked later, because we check for the vma sequence counter which is
updated in unmap_page_range() before locking the pte, and then in
free_pgtables() so when locking the pte the change will be detected.
Change-Id: I6c1f3b5c811d1ddd7b3f769082e8bbd40f5b52a0
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:17
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
If a thread is remapping an area while another one is faulting on the
destination area, the SPF handler may fetch the vma from the RB tree before
the pte has been moved by the other thread. This means that the moved ptes
will overwrite those create by the page fault handler leading to page
leaked.
CPU 1 CPU2
enter mremap()
unmap the dest area
copy_vma() Enter speculative page fault handler
>> at this time the dest area is present in the RB tree
fetch the vma matching dest area
create a pte as the VMA matched
Exit the SPF handler
<data written in the new page>
move_ptes()
> it is assumed that the dest area is empty,
> the move ptes overwrite the page mapped by the CPU2.
To prevent that, when the VMA matching the dest area is extended or created
by copy_vma(), it should be marked as non available to the SPF handler.
The usual way to so is to rely on vm_write_begin()/end().
This is already in __vma_adjust() called by copy_vma() (through
vma_merge()). But __vma_adjust() is calling vm_write_end() before returning
which create a window for another thread.
This patch adds a new parameter to vma_merge() which is passed down to
vma_adjust().
The assumption is that copy_vma() is returning a vma which should be
released by calling vm_raw_write_end() by the callee once the ptes have
been moved.
Change-Id: Icd338ad6e9b3c97b7334d3b8d30a8badfa2a4efa
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:16
[vinmenon@codeaurora.org: changes in vma_merge arguments related
to the anon vma user name which is not suppported upstream.]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.
This patch provides protection against the VMA modification done in :
- madvise()
- mpol_rebind_policy()
- vma_replace_policy()
- change_prot_numa()
- mlock(), munlock()
- mprotect()
- mmap_region()
- collapse_huge_page()
- userfaultd registering services
In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.
Change-Id: Ic36046b7254e538b6baf7144c50ae577ee7f2074
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:15
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence
counts such that we can easily test if a VMA is changed.
The unmap_page_range() one allows us to make assumptions about
page-tables; when we find the seqcount hasn't changed we can assume
page-tables are still valid.
The flip side is that we cannot distinguish between a vma_adjust() and
the unmap_page_range() -- where with the former we could have
re-checked the vma bounds against the address.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Port to 4.12 kernel]
[Build depends on CONFIG_SPECULATIVE_PAGE_FAULT]
[Introduce vm_write_* inline function depending on
CONFIG_SPECULATIVE_PAGE_FAULT]
[Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by
using vm_raw_write* functions]
[Fix a lock dependency warning in mmap_region() when entering the error
path]
[move sequence initialisation INIT_VMA()]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Change-Id: Ibc23ef3b9dbb80323c0f24cb06da34b4c3a8fa71
Patch-mainline: linux-mm @ 17 Apr 2018 16:33:14
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Some VMA struct fields need to be initialized once the VMA structure is
allocated.
Currently this only concerns anon_vma_chain field but some other will be
added to support the speculative page fault.
Instead of spreading the initialization calls all over the code, let's
introduce a dedicated inline function.
Change-Id: I9f6b29dc74055354318b548e2b6b22c37d4c61bb
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:13
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
pte_unmap_same() is making the assumption that the page table are still
around because the mmap_sem is held.
This is no more the case when running a speculative page fault and
additional check must be made to ensure that the final page table are still
there.
This is now done by calling pte_spinlock() to check for the VMA's
consistency while locking for the page tables.
This is requiring passing a vm_fault structure to pte_unmap_same() which is
containing all the needed parameters.
As pte_spinlock() may fail in the case of a speculative page fault, if the
VMA has been touched in our back, pte_unmap_same() should now return 3
cases :
1. pte are the same (0)
2. pte are different (VM_FAULT_PTNOTSAME)
3. a VMA's changes has been detected (VM_FAULT_RETRY)
The case 2 is handled by the introduction of a new VM_FAULT flag named
VM_FAULT_PTNOTSAME which is then trapped in cow_user_page().
If VM_FAULT_RETRY is returned, it is passed up to the callers to retry the
page fault while holding the mmap_sem.
Change-Id: Iaccfa0d877334f4343f8b0ec3400af5070ff5864
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:12
[vinmenon@codeaurora.org: trivial merge conflicts]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
When handling page fault without holding the mmap_sem the fetch of the
pte lock pointer and the locking will have to be done while ensuring
that the VMA is not touched in our back.
So move the fetch and locking operations in a dedicated function.
Change-Id: If93ab95b1d22b7195e1c15b57315021f6be7c394
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:11
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
When speculating faults (without holding mmap_sem) we need to validate
that the vma against which we loaded pages is still valid when we're
ready to install the new PTE.
Therefore, replace the pte_offset_map_lock() calls that (re)take the
PTL with pte_map_lock() which can fail in case we find the VMA changed
since we started the fault.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[Port to 4.12 kernel]
[Remove the comment about the fault_env structure which has been
implemented as the vm_fault structure in the kernel]
[move pte_map_lock()'s definition upper in the file]
[move the define of FAULT_FLAG_SPECULATIVE later in the series]
[review error path in do_swap_page(), do_anonymous_page() and
wp_page_copy()]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Change-Id: Id6dfae130fbfdd4bb92aa6415d6f1db7ef833266
[vinmenon@codeaurora.org: fix trivial merge conflicts]
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:10
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
This configuration variable will be used to build the code needed to
handle speculative page fault.
By default it is turned off, and activated depending on architecture
support, SMP and MMU.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Change-Id: I17123124ec0667a0f0af741a740e5219d278620c
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:07
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>