184 Commits

Author SHA1 Message Date
Mel Gorman
9fdece4b65
mm, page_alloc: spread allocations across zones before introducing fragmentation
Patch series "Fragmentation avoidance improvements", v5.

It has been noted before that fragmentation avoidance (aka
anti-fragmentation) is not perfect. Given sufficient time or an adverse
workload, memory gets fragmented and the long-term success of high-order
allocations degrades. This series defines an adverse workload, a definition
of external fragmentation events (including serious) ones and a series
that reduces the level of those fragmentation events.

The details of the workload and the consequences are described in more
detail in the changelogs. However, from patch 1, this is a high-level
summary of the adverse workload. The exact details are found in the
mmtests implementation.

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch)
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameterr create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed
3. Warm up a number of fio read-only threads accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll fault back in old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup

Overall the series reduces external fragmentation causing events by over 94%
on 1 and 2 socket machines, which in turn impacts high-order allocation
success rates over the long term. There are differences in latencies and
high-order allocation success rates. Latencies are a mixed bag as they
are vulnerable to exact system state and whether allocations succeeded
so they are treated as a secondary metric.

Patch 1 uses lower zones if they are populated and have free memory
	instead of fragmenting a higher zone. It's special cased to
	handle a Normal->DMA32 fallback with the reasons explained
	in the changelog.

Patch 2-4 boosts watermarks temporarily when an external fragmentation
	event occurs. kswapd wakes to reclaim a small amount of old memory
	and then wakes kcompactd on completion to recover the system
	slightly. This introduces some overhead in the slowpath. The level
	of boosting can be tuned or disabled depending on the tolerance
	for fragmentation vs allocation latency.

Patch 5 stalls some movable allocation requests to let kswapd from patch 4
	make some progress. The duration of the stalls is very low but it
	is possible to tune the system to avoid fragmentation events if
	larger stalls can be tolerated.

The bulk of the improvement in fragmentation avoidance is from patches
1-4 but patch 5 can deal with a rare corner case and provides the option
of tuning a system for THP allocation success rates in exchange for
some stalls to control fragmentation.

This patch (of 5):

The page allocator zone lists are iterated based on the watermarks of each
zone which does not take anti-fragmentation into account.  On x86, node 0
may have multiple zones while other nodes have one zone.  A consequence is
that tasks running on node 0 may fragment ZONE_NORMAL even though
ZONE_DMA32 has plenty of free memory.  This patch special cases the
allocator fast path such that it'll try an allocation from a lower local
zone before fragmenting a higher zone.  In this case, stealing of
pageblocks or orders larger than a pageblock are still allowed in the fast
path as they are uninteresting from a fragmentation point of view.

This was evaluated using a benchmark designed to fragment memory before
attempting THP allocations.  It's implemented in mmtests as the following
configurations

configs/config-global-dhp__workload_thpfioscale
configs/config-global-dhp__workload_thpfioscale-defrag
configs/config-global-dhp__workload_thpfioscale-madvhugepage

e.g. from mmtests
./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

The broad details of the workload are as follows;

1. Create an XFS filesystem (not specified in the configuration but done
   as part of the testing for this patch).
2. Start 4 fio threads that write a number of 64K files inefficiently.
   Inefficiently means that files are created on first access and not
   created in advance (fio parameter create_on_open=1) and fallocate
   is not used (fallocate=none). With multiple IO issuers this creates
   a mix of slab and page cache allocations over time. The total size
   of the files is 150% physical memory so that the slabs and page cache
   pages get mixed.
3. Warm up a number of fio read-only processes accessing the same files
   created in step 2. This part runs for the same length of time it
   took to create the files. It'll refault old data and further
   interleave slab and page cache allocations. As it's now low on
   memory due to step 2, fragmentation occurs as pageblocks get
   stolen.
4. While step 3 is still running, start a process that tries to allocate
   75% of memory as huge pages with a number of threads. The number of
   threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
   threads contending with fio, any other threads or forcing cross-NUMA
   scheduling. Note that the test has not been used on a machine with less
   than 8 cores. The benchmark records whether huge pages were allocated
   and what the fault latency was in microseconds.
5. Measure the number of events potentially causing external fragmentation,
   the fault latency and the huge page allocation success rate.
6. Cleanup the test files.

Note that due to the use of IO and page cache that this benchmark is not
suitable for running on large machines where the time to fragment memory
may be excessive.  Also note that while this is one mix that generates
fragmentation that it's not the only mix that generates fragmentation.
Differences in workload that are more slab-intensive or whether SLUB is
used with high-order pages may yield different results.

When the page allocator fragments memory, it records the event using the
mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
a pageblock order (order-9 on 64-bit x86) then it's considered to be an
"external fragmentation event" that may cause issues in the future.
Hence, the primary metric here is the number of external fragmentation
events that occur with order < 9.  The secondary metric is allocation
latency and huge page allocation success rates but note that differences
in latencies and what the success rate also can affect the number of
external fragmentation event which is why it's a secondary metric.

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)

Fault latencies are slightly reduced while allocation success rates remain
at zero as this configuration does not make any special effort to allocate
THP and fio is heavily active at the time and either filling memory or
keeping pages resident.  However, a 49% reduction of serious fragmentation
events reduces the changes of external fragmentation being a problem in
the future.

Vlastimil asked during review for a breakdown of the allocation types
that are falling back.

vanilla
   3816 MIGRATE_UNMOVABLE
 800845 MIGRATE_MOVABLE
     33 MIGRATE_UNRECLAIMABLE

patch
    735 MIGRATE_UNMOVABLE
 408135 MIGRATE_MOVABLE
     42 MIGRATE_UNRECLAIMABLE

The majority of the fallbacks are due to movable allocations and this is
consistent for the workload throughout the series so will not be presented
again as the primary source of fallbacks are movable allocations.

Movable fallbacks are sometimes considered "ok" to fallback because they
can be migrated.  The problem is that they can fill an
unmovable/reclaimable pageblock causing those allocations to fallback
later and polluting pageblocks with pages that cannot move.  If there is a
movable fallback, it is pretty much guaranteed to affect an
unmovable/reclaimable pageblock and while it might not be enough to
actually cause a unmovable/reclaimable fallback in the future, we cannot
know that in advance so the patch takes the only option available to it.
Hence, it's important to control them.  This point is also consistent
throughout the series and will not be repeated.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)

Fragmentation events were reduced quite a bit although this is known
to be a little variable. The latencies and allocation success rates
are similar but they were already quite high.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)

                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)

The reduction of external fragmentation events is slight and this is
partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
allocations can now spill over to remote nodes instead of fragmenting
local memory.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                      vanilla           lowzone-v5r8
Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*

thpfioscale Percentage Faults Huge
                              4.20.0-rc3             4.20.0-rc3
                                 vanilla           lowzone-v5r8
Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)

There was a slight reduction in external fragmentation events although the
latencies were higher.  The allocation success rate is high enough that
the system is struggling and there is quite a lot of parallel reclaim and
compaction activity.  There is also a certain degree of luck on whether
processes start on node 0 or not for this patch but the relevance is
reduced later in the series.

Overall, the patch reduces the number of external fragmentation causing
events so the success of THP over long periods of time would be improved
for this adverse workload.

Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

[RealJohnGalt]: adapt for 4.14

Signed-off-by: azrim <mirzaspc@gmail.com>
2022-08-25 14:30:41 +00:00
Edgar Arriaga García
083dbe75d0
BACKPORT: mm: introduce MADV_COLD
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use.  This could reduce
workingset eviction so it ends up increasing performance.

This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future.  The hint can help kernel in deciding which
pages to evict early during memory pressure.

It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

        active file page -> inactive file LRU
        active anon page -> inacdtive anon LRU

Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because the
content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault.  Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages.  It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied.  Even, it could give a bonus to make
them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost.  Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU.  Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device.  Let's start simpler way without adding
complexity at this moment.  However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD on
anonymous memory because we rarely age anonymous LRU lists.

* man-page material

MADV_COLD (since Linux x.x)

Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.  In
contrast to MADV_FREE, the contents of the region are preserved regardless
of subsequent writes to pages.

MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.

[akpm@linux-foundation.org: resolve conflicts with hmm.git]
Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleksandr Natalenko <oleksandr@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Sonny Rao <sonnyrao@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tim Murray <timmurray@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 9c276cc65a58faf98be8e56962745ec99ab87636)

Conflicts:
    include/asm-generic/tlb.h

1. Added tlb_change_page_size function from commit ed6a79352cad00e9a49d6e438be40e45107207bf
which was required by this patch.

2. the lru_deactivate_pvecs and other swap.c changes are skipped as they already
existed in the repo, they were backported in change-id I06fed20103671e4ca6fb8663d5029736442162a5

Bug: 153444106
Test: Built kernel
Signed-off-by: Edgar Arriaga García <edgararriaga@google.com>
Change-Id: I8f0f9d54e2f3d0ffe75c54f4db67e73f60083482
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-07-31 09:36:44 +00:00
Sultan Alsawaf
e8a93fb383
mm: Move slab shrinkers into a dedicated thread called kshrinkd
Slab shrinkers are unique in that they have no bearing on direct reclaim
throttling and reclaim statistics, meaning that they can be completely
decoupled from the kswapd and direct reclaim paths, making them faster.
Further justification for this is that shrinkers may be slow and aren't
guaranteed to free any memory, which means that they can waste kswapd
and direct reclaim time without producing results. Running the same
shrinker concurrently can also produce unwanted lock contention inside
the shrinker itself.

An empirical test of checking how long kswapd and kshrinkd ran on a
memory constrained system showed that kshrinkd ran for about 10% of the
duration that kswapd ran, which indicates significant savings.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-05-03 06:17:10 +07:00
Sultan Alsawaf
e6a815cb94
mm: Don't stop kswapd on a per-node basis when there are no waiters
The page allocator wakes all kswapds in an allocation context's allowed
nodemask in the slow path, so it doesn't make sense to have the kswapd-
waiter count per each NUMA node. Instead, it should be a global counter
to stop all kswapds when there are no failed allocation requests.

Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:20:20 +07:00
Arian
3a330c6445 Merge branch 'android-4.14-stable' of https://android.googlesource.com/kernel/common into HEAD
Change-Id: I714223aa1f97959bd97b6bf758511466c9394bd8
2022-03-16 00:46:24 +01:00
Hugh Dickins
4dfa0d6f48 mm/thp: fix vma_address() if virtual address below file offset
[ Upstream commit 494334e43c16d63b878536a26505397fce6ff3a2 ]

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for
try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low:
vma_address() chooses max(start, vma->vm_start), but that decides on the
wrong address, because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM
pages, which cannot be located by the page->index that page_to_pgoff()
uses: as commit 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte()
for ksm pages") once discovered.  I dithered over the best thing to do
about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
vma_address() and vma_address_end(); though the only place in danger of
using it on them was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never
mattered.

Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jue Wang <juew@google.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wang Yugui <wangyugui@e16-tech.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Note on stable backport: fixed up conflicts on intervening thp_size(),
and mmu_notifier_range initializations; substitute for compound_nr().

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-11 12:48:10 +02:00
Srinivasarao P
f3f0576c22 Merge android-4.14.158 (84afceb) into msm-4.14
* refs/heads/tmp-84afceb:
  Linux 4.14.158
  net: fec: fix clock count mis-match
  platform/x86: hp-wmi: Fix ACPI errors caused by passing 0 as input size
  platform/x86: hp-wmi: Fix ACPI errors caused by too small buffer
  ASoC: stm32: i2s: fix IRQ clearing
  ASoC: stm32: i2s: fix 16 bit format support
  ASoC: stm32: i2s: fix dma configuration
  pinctrl: stm32: fix memory leak issue
  mailbox: mailbox-test: fix null pointer if no mmio
  hwrng: stm32 - fix unbalanced pm_runtime_enable
  media: stm32-dcmi: fix DMA corruption when stopping streaming
  crypto: stm32/hash - Fix hmac issue more than 256 bytes
  HID: core: check whether Usage Page item is after Usage ID items
  futex: Prevent exit livelock
  futex: Provide distinct return value when owner is exiting
  futex: Add mutex around futex exit
  futex: Provide state handling for exec() as well
  futex: Sanitize exit state handling
  futex: Mark the begin of futex exit explicitly
  futex: Set task::futex_state to DEAD right after handling futex exit
  futex: Split futex_mm_release() for exit/exec
  exit/exec: Seperate mm_release()
  futex: Replace PF_EXITPIDONE with a state
  futex: Move futex exit handling into futex code
  futex: Prevent robust futex exit race
  y2038: futex: Move compat implementation into futex.c
  mtd: spi-nor: cast to u64 to avoid uint overflows
  mtd: rawnand: atmel: fix possible object reference leak
  mtd: rawnand: atmel: Fix spelling mistake in error message
  net: macb driver, check for SKBTX_HW_TSTAMP
  net: macb: Fix SUBNS increment and increase resolution
  watchdog: sama5d4: fix WDD value to be always set to max
  ext4: add more paranoia checking in ext4_expand_extra_isize handling
  net: sched: fix `tc -s class show` no bstats on class with nolock subqueues
  sctp: cache netns in sctp_ep_common
  tipc: fix link name length check
  openvswitch: remove another BUG_ON()
  openvswitch: drop unneeded BUG_ON() in ovs_flow_cmd_build_info()
  slip: Fix use-after-free Read in slip_open
  openvswitch: fix flow command message size
  net: psample: fix skb_over_panic
  macvlan: schedule bc_work even if error
  media: atmel: atmel-isc: fix INIT_WORK misplacement
  media: atmel: atmel-isc: fix asd memory allocation
  pwm: Clear chip_data in pwm_put()
  net: macb: fix error format in dev_err()
  media: v4l2-ctrl: fix flags for DO_WHITE_BALANCE
  xfrm: Fix memleak on xfrm state destroy
  mei: bus: prefix device names on bus with the bus name
  USB: serial: ftdi_sio: add device IDs for U-Blox C099-F9P
  staging: rtl8723bs: Add 024c:0525 to the list of SDIO device-ids
  staging: rtl8723bs: Drop ACPI device ids
  staging: rtl8192e: fix potential use after free
  clk: at91: generated: set audio_pll_allowed in at91_clk_register_generated()
  clk: at91: fix update bit maps on CFG_MOR write
  mm, gup: add missing refcount overflow checks on s390
  mtd: Remove a debug trace in mtdpart.c
  powerpc/pseries/dlpar: Fix a missing check in dlpar_parse_cc_property()
  scsi: libsas: Check SMP PHY control function result
  ACPI / APEI: Switch estatus pool to use vmalloc memory
  ACPI / APEI: Don't wait to serialise with oops messages when panic()ing
  scsi: libsas: Support SATA PHY connection rate unmatch fixing during discovery
  apparmor: delete the dentry in aafs_remove() to avoid a leak
  iommu/amd: Fix NULL dereference bug in match_hid_uid
  net: hns3: Change fw error code NOT_EXEC to NOT_SUPPORTED
  bpf: drop refcount if bpf_map_new_fd() fails in map_create()
  kvm: properly check debugfs dentry before using it
  net: dev: Use unsigned integer as an argument to left-shift
  bpf: decrease usercnt if bpf_map_new_fd() fails in bpf_map_get_fd_by_id()
  sctp: don't compare hb_timer expire date before starting it
  net: fix possible overflow in __sk_mem_raise_allocated()
  sfc: initialise found bitmap in efx_ef10_mtd_probe
  tipc: fix skb may be leaky in tipc_link_input
  blktrace: Show requests without sector
  net/smc: prevent races between smc_lgr_terminate() and smc_conn_free()
  decnet: fix DN_IFREQ_SIZE
  ip_tunnel: Make none-tunnel-dst tunnel port work with lwtunnel
  sfc: suppress duplicate nvmem partition types in efx_ef10_mtd_probe
  gpu: ipu-v3: pre: don't trigger update if buffer address doesn't change
  serial: 8250: Fix serial8250 initialization crash
  net/core/neighbour: fix kmemleak minimal reference count for hash tables
  PCI/MSI: Return -ENOSPC from pci_alloc_irq_vectors_affinity()
  net/core/neighbour: tell kmemleak about hash tables
  tipc: fix memory leak in tipc_nl_compat_publ_dump
  mtd: Check add_mtd_device() ret code
  lib/genalloc.c: include vmalloc.h
  drivers/base/platform.c: kmemleak ignore a known leak
  fork: fix some -Wmissing-prototypes warnings
  lib/genalloc.c: use vzalloc_node() to allocate the bitmap
  lib/genalloc.c: fix allocation of aligned buffer from non-aligned chunk
  vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n
  ocfs2: clear journal dirty flag after shutdown journal
  net/wan/fsl_ucc_hdlc: Avoid double free in ucc_hdlc_probe()
  tipc: fix a missing check of genlmsg_put
  atl1e: checking the status of atl1e_write_phy_reg
  net: dsa: bcm_sf2: Propagate error value from mdio_write
  net: stmicro: fix a missing check of clk_prepare
  net: (cpts) fix a missing check of clk_prepare
  um: Make GCOV depend on !KCOV
  f2fs: fix to dirty inode synchronously
  net/net_namespace: Check the return value of register_pernet_subsys()
  net/netlink_compat: Fix a missing check of nla_parse_nested
  pwm: clps711x: Fix period calculation
  crypto: mxc-scc - fix build warnings on ARM64
  powerpc/pseries: Fix node leak in update_lmb_associativity_index()
  powerpc/83xx: handle machine check caused by watchdog timer
  regulator: tps65910: fix a missing check of return value
  IB/rxe: Make counters thread safe
  drbd: fix print_st_err()'s prototype to match the definition
  drbd: do not block when adjusting "disk-options" while IO is frozen
  drbd: reject attach of unsuitable uuids even if connected
  drbd: ignore "all zero" peer volume sizes in handshake
  powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status
  vfio/spapr_tce: Get rid of possible infinite loop
  powerpc/44x/bamboo: Fix PCI range
  powerpc/mm: Make NULL pointer deferences explicit on bad page faults.
  powerpc/prom: fix early DEBUG messages
  powerpc/perf: Fix unit_sel/cache_sel checks
  ath6kl: Fix off by one error in scan completion
  ath6kl: Only use match sets when firmware supports it
  scsi: csiostor: fix incorrect dma device in case of vport
  scsi: qla2xxx: deadlock by configfs_depend_item
  RDMA/srp: Propagate ib_post_send() failures to the SCSI mid-layer
  openrisc: Fix broken paths to arch/or32
  serial: max310x: Fix tx_empty() callback
  Bluetooth: hci_bcm: Handle specific unknown packets after firmware loading
  drivers/regulator: fix a missing check of return value
  powerpc/xmon: fix dump_segments()
  powerpc/book3s/32: fix number of bats in p/v_block_mapped()
  vxlan: Fix error path in __vxlan_dev_create()
  clocksource/drivers/fttmr010: Fix invalid interrupt register access
  IB/qib: Fix an error code in qib_sdma_verbs_send()
  xfs: Fix bulkstat compat ioctls on x32 userspace.
  xfs: Align compat attrlist_by_handle with native implementation.
  gfs2: take jdata unstuff into account in do_grow
  dm flakey: Properly corrupt multi-page bios.
  HID: doc: fix wrong data structure reference for UHID_OUTPUT
  pinctrl: sh-pfc: sh7734: Fix shifted values in IPSR10
  pinctrl: sh-pfc: sh7264: Fix PFCR3 and PFCR0 register configuration
  KVM: s390: unregister debug feature on failing arch init
  bnxt_en: query force speeds before disabling autoneg mode.
  bnxt_en: Return linux standard errors in bnxt_ethtool.c
  exofs_mount(): fix leaks on failure exits
  net/mlx5: Continue driver initialization despite debugfs failure
  pinctrl: xway: fix gpio-hog related boot issues
  vfio-mdev/samples: Use u8 instead of char for handle functions
  xen/pciback: Check dev_data before using it
  kprobes/x86/xen: blacklist non-attachable xen interrupt functions
  serial: 8250: Rate limit serial port rx interrupts during input overruns
  HID: intel-ish-hid: fixes incorrect error handling
  btrfs: only track ref_heads in delayed_ref_updates
  mtd: rawnand: sunxi: Write pageprog related opcodes to WCMD_SET
  mmc: meson-gx: make sure the descriptor is stopped on errors
  VSOCK: bind to random port for VMADDR_PORT_ANY
  kvm: vmx: Set IA32_TSC_AUX for legacy mode guests
  gpiolib: Fix return value of gpio_to_desc() stub if !GPIOLIB
  iwlwifi: move iwl_nvm_check_version() into dvm
  microblaze: move "... is ready" messages to arch/microblaze/Makefile
  microblaze: adjust the help to the real behavior
  ubi: Do not drop UBI device reference before using
  ubi: Put MTD device after it is not used
  xfs: require both realtime inodes to mount
  rtl818x: fix potential use after free
  mwifiex: debugfs: correct histogram spacing, formatting
  mwifiex: fix potential NULL dereference and use after free
  crypto: user - support incremental algorithm dumps
  scsi: lpfc: Enable Management features for IF_TYPE=6
  ACPI / LPSS: Ignore acpi_device_fix_up_power() return value
  ARM: ks8695: fix section mismatch warning
  PM / AVS: SmartReflex: NULL check before some freeing functions is not needed
  RDMA/vmw_pvrdma: Use atomic memory allocation in create AH
  ARM: OMAP1: fix USB configuration for device-only setups
  arm64: smp: Handle errors reported by the firmware
  arm64: mm: Prevent mismatched 52-bit VA support
  parisc: Fix HP SDC hpa address output
  parisc: Fix serio address output
  ARM: dts: imx53-voipac-dmm-668: Fix memory node duplication
  ARM: debug-imx: only define DEBUG_IMX_UART_PORT if needed
  ARM: dts: Fix up SQ201 flash access
  scsi: lpfc: Fix dif and first burst use in write commands
  scsi: lpfc: Fix kernel Oops due to null pring pointers
  pwm: bcm-iproc: Prevent unloading the driver module while in use
  block: drbd: remove a stray unlock in __drbd_send_protocol()
  mac80211: fix station inactive_time shortly after boot
  ceph: return -EINVAL if given fsc mount option on kernel w/o support
  net: bcmgenet: reapply manual settings to the PHY
  scripts/gdb: fix debugging modules compiled with hot/cold partitioning
  watchdog: meson: Fix the wrong value of left time
  can: rx-offload: can_rx_offload_irq_offload_fifo(): continue on error
  can: rx-offload: can_rx_offload_irq_offload_timestamp(): continue on error
  can: rx-offload: can_rx_offload_offload_one(): use ERR_PTR() to propagate error value in case of errors
  can: rx-offload: can_rx_offload_offload_one(): increment rx_fifo_errors on queue overflow or OOM
  can: rx-offload: can_rx_offload_offload_one(): do not increase the skb_queue beyond skb_queue_len_max
  can: rx-offload: can_rx_offload_queue_tail(): fix error handling, avoid skb mem leak
  can: c_can: D_CAN: c_can_chip_config(): perform a sofware reset on open
  can: peak_usb: report bus recovery as well
  bridge: ebtables: don't crash when using dnat target in output chains
  net: fec: add missed clk_disable_unprepare in remove
  clk: ti: dra7-atl-clock: Remove ti_clk_add_alias call
  x86/resctrl: Prevent NULL pointer dereference when reading mondata
  idr: Fix idr_alloc_u32 on 32-bit systems
  clk: sunxi-ng: a80: fix the zero'ing of bits 16 and 18
  clk: at91: avoid sleeping early
  reset: fix reset_control_ops kerneldoc comment
  clk: samsung: exynos5420: Preserve PLL configuration during suspend/resume
  ASoC: kirkwood: fix external clock probe defer
  reset: Fix memory leak in reset_control_array_put()
  ASoC: compress: fix unsigned integer overflow check
  ASoC: msm8916-wcd-analog: Fix RX1 selection in RDAC2 MUX
  clk: meson: gxbb: let sar_adc_clk_div set the parent clock rate
  Revert "KVM: nVMX: reset cache/shadows when switching loaded VMCS"
  UPSTREAM: dt-bindings: arm: coresight: Add support for coresight-loses-context-with-cpu
  BACKPORT: coresight: etm4x: Save/restore state across CPU low power states
  BACKPORT: ARM: 8900/1: UNWINDER_FRAME_POINTER implementation for Clang

Conflicts:
	Documentation/devicetree/bindings/arm/coresight.txt
	arch/arm/Makefile
	drivers/hid/hid-core.c
	kernel/exit.c

Reverted the downstream patch "HID: core: add usage_page_preceding flag for hid_concatenate_usage_page()"
as original issue got fixed with upstream changes.

Change-Id: I3b833825b3d1104fa07378caef144639074d0a0d
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
2020-04-16 16:59:09 +05:30
Wei Yang
1a439b577e vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n
[ Upstream commit 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 ]

Commit fa5e084e43eb ("vmscan: do not unconditionally treat zones that
fail zone_reclaim() as full") changed the return value of
node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
after this commit.

While the return value of node_reclaim() when CONFIG_NUMA is n is not
changed.  This will leads to call zone_watermark_ok() again.

This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
Since node_reclaim() is only called in page_alloc.c, move it to
mm/internal.h.

Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-05 15:37:51 +01:00
Charan Teja Reddy
5cf45a00e3 mm: add a feedback from lowmemorykiller for OOM decisions
In the alloc slow path, for non-costly order allocations, we rely on the
should_reclaim_retry() function to decide whether we should go for page
reclaim before landing into OOM. But when low memory killer is enabled,
the wrong accounting of anon lru pages can result into unnecessary
retries that inturn lead to page allocation stalls for longer
durations. To avoid this situation, get input from the LMK that
has already the info about whether there are any lmk killable tasks.
This also fixes commit 23ba63a1c103 ("mm: retry more before OOM in the
presence of slow shrinkers")

Change-Id: I91416b7646d4545f7762402e50c6193771f62605
Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>
2018-08-29 16:50:21 +05:30
Peter Zijlstra
ce37f07c6b mm: provide speculative fault infrastructure
Provide infrastructure to do a speculative fault (not holding
mmap_sem).

The not holding of mmap_sem means we can race against VMA
change/removal and page-table destruction. We use the SRCU VMA freeing
to keep the VMA around. We use the VMA seqcount to detect change
(including umapping / page-table deletion) and we use gup_fast() style
page-table walking to deal with page-table races.

Once we've obtained the page and are ready to update the PTE, we
validate if the state we started the fault with is still valid, if
not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
PTE and we're done.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

[Manage the newly introduced pte_spinlock() for speculative page
 fault to fail if the VMA is touched in our back]
[Rename vma_is_dead() to vma_has_changed() and declare it here]
[Fetch p4d and pud]
[Set vmd.sequence in __handle_mm_fault()]
[Abort speculative path when handle_userfault() has to be called]
[Add additional VMA's flags checks in handle_speculative_fault()]
[Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
[Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
[Remove warning comment about waiting for !seq&1 since we don't want
 to wait]
[Remove warning about no huge page support, mention it explictly]
[Don't call do_fault() in the speculative path as __do_fault() calls
 vma->vm_ops->fault() which may want to release mmap_sem]
[Only vm_fault pointer argument for vma_has_changed()]
[Fix check against huge page, calling pmd_trans_huge()]
[Use READ_ONCE() when reading VMA's fields in the speculative path]
[Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
 processing done in vm_normal_page()]
[Check that vma->anon_vma is already set when starting the speculative
 path]
[Check for memory policy as we can't support MPOL_INTERLEAVE case due to
 the processing done in mpol_misplaced()]
[Don't support VMA growing up or down]
[Move check on vm_sequence just before calling handle_pte_fault()]
[Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
[Add mem cgroup oom check]
[Use READ_ONCE to access p*d entries]
[Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
[Don't fetch pte again in handle_pte_fault() when running the speculative
 path]
[Check PMD against concurrent collapsing operation]
[Try spin lock the pte during the speculative path to avoid deadlock with
 other CPU's invalidating the TLB and requiring this CPU to catch the
 inter processor's interrupt]
[Move define of FAULT_FLAG_SPECULATIVE here]
[Introduce __handle_speculative_fault() and add a check against
 mm->mm_users in handle_speculative_fault() defined in mm.h]
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Change-Id: I45cbe79a96c7aa234cace421a36550e7fb0b9368
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:24
[vinmenon@codeaurora.org: fix !CONFIG_NUMA build +
checkpatch fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2018-06-04 15:51:46 +05:30
Laurent Dufour
eba2eb1ba0 mm: protect mm_rb tree with a rwlock
This change is inspired by the Peter's proposal patch [1] which was
protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
that particular case, and it is introducing major performance degradation
due to excessive scheduling operations.

To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
search it is safe to protect it using such a lock.  The VMA cache is not
protected by the new rwlock and it should not be used without holding the
mmap_sem.

To allow the picked VMA structure to be used once the rwlock is released, a
use count is added to the VMA structure. When the VMA is allocated it is
set to 1.  Each time the VMA is picked with the rwlock held its use count
is incremented. Each time the VMA is released it is decremented. When the
use count hits zero, this means that the VMA is no more used and should be
freed.

This patch is preparing for 2 kind of VMA access :
 - as usual, under the control of the mmap_sem,
 - without holding the mmap_sem for the speculative page fault handler.

Access done under the control the mmap_sem doesn't require to grab the
rwlock to protect read access to the mm_rb tree, but access in write must
be done under the protection of the rwlock too. This affects inserting and
removing of elements in the RB tree.

The patch is introducing 2 new functions:
 - vma_get() to find a VMA based on an address by holding the new rwlock.
 - vma_put() to release the VMA when its no more used.
These services are designed to be used when access are made to the RB tree
without holding the mmap_sem.

When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
we rely on the WMB done when releasing the rwlock to serialize the write
with the RMB done in a later patch to check for the VMA's validity.

When free_vma is called, the file associated with the VMA is closed
immediately, but the policy and the file structure remained in used until
the VMA's use count reach 0, which may happens later when exiting an
in progress speculative page fault.

[1] https://patchwork.kernel.org/patch/5108281/

Change-Id: I9ecc922b8efa4b28975cc6a8e9531284c24ac14e
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:23
[vinmenon@codeaurora.org: fix the return of put_vma]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2018-06-04 15:51:45 +05:30
Michal Hocko
cd04ae1e2d mm, oom: do not rely on TIF_MEMDIE for memory reserves access
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full access
to memory reserves.  There are few shortcomings of this implementation,
though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates on
mm_struct granularity so all processes sharing the given mm are killed.
Giving the full access to all these task_structs could lead to a quick
memory reserves depletion.  We have tried to reduce this risk by giving
TIF_MEMDIE only to the main thread and the currently allocating task but
that doesn't really solve this problem while it surely opens up a room
for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
allocator without access to memory reserves because a particular thread
was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing.  We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory.  oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves.  This makes the access to reserves independent
on which task has passed through mark_oom_victim.  Also drop any usage
of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once.  This change will allow such a usecase
without worrying about complete memory reserves depletion.

Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
Michal Hocko
72675e131e mm, memory_hotplug: drop zone from build_all_zonelists
build_all_zonelists gets a zone parameter to initialize zone's pagesets.
There is only a single user which gives a non-NULL zone parameter and
that one doesn't really need the rest of the build_all_zonelists (see
commit 6dcd73d7011b ("memory-hotplug: allocate zone's pcp before
onlining pages")).

Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly.  This will also remove a pointless zonlists
rebuilding which is always good.

Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:25 -07:00
Mel Gorman
3ea277194d mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.

For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access.  For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB.  However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption.  In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch.  Other users such as gup as a race with
reclaim looks just at PTEs.  huge page variants should be ok as they
don't race with reclaim.  mincore only looks at PTEs.  userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.

Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected.  His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.

[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: <stable@vger.kernel.org>	[v4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 16:34:46 -07:00
Michal Hocko
dcda9b0471 mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Vlastimil Babka
baf6a9a1db mm, compaction: finish whole pageblock to reduce fragmentation
The main goal of direct compaction is to form a high-order page for
allocation, but it should also help against long-term fragmentation when
possible.

Most lower-than-pageblock-order compactions are for non-movable
allocations, which means that if we compact in a movable pageblock and
terminate as soon as we create the high-order page, it's unlikely that
the fallback heuristics will claim the whole block.  Instead there might
be a single unmovable page in a pageblock full of movable pages, and the
next unmovable allocation might pick another pageblock and increase
long-term fragmentation.

To help against such scenarios, this patch changes the termination
criteria for compaction so that the current pageblock is finished even
though the high-order page already exists.  Note that it might be
possible that the high-order page formed elsewhere in the zone due to
parallel activity, but this patch doesn't try to detect that.

This is only done with sync compaction, because async compaction is
limited to pageblock of the same migratetype, where it cannot result in
a migratetype fallback.  (Async compaction also eagerly skips
order-aligned blocks where isolation fails, which is against the goal of
migrating away as much of the pageblock as possible.)

As a result of this patch, long-term memory fragmentation should be
reduced.

In testing based on 4.9 kernel with stress-highalloc from mmtests
configured for order-4 GFP_KERNEL allocations, this patch has reduced
the number of unmovable allocations falling back to movable pageblocks
by 20%.  The number

Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:10 -07:00
Vlastimil Babka
d39773a062 mm, compaction: add migratetype to compact_control
Preparation patch.  We are going to need migratetype at lower layers
than compact_zone() and compact_finished().

Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:10 -07:00
Vlastimil Babka
f25ba6dccc mm, compaction: reorder fields in struct compact_control
Patch series "try to reduce fragmenting fallbacks", v3.

Last year, Johannes Weiner has reported a regression in page mobility
grouping [1] and while the exact cause was not found, I've come up with
some ways to improve it by reducing the number of allocations falling
back to different migratetype and causing permanent fragmentation.

The series was tested with mmtests stress-highalloc modified to do
GFP_KERNEL order-4 allocations, on 4.9 with "mm, vmscan: fix zone
balance check in prepare_kswapd_sleep" (without that, kcompactd indeed
wasn't woken up) on UMA machine with 4GB memory.  There were 5 repeats
of each run, as the extfrag stats are quite volatile (note the stats
below are sums, not averages, as it was less perl hacking for me).

Success rate are the same, already high due to the low allocation order
used, so I'm not including them.

Compaction stats:
(the patches are stacked, and I haven't measured the non-functional-changes
patches separately)

                                     patch 1     patch 2     patch 3     patch 4     patch 7     patch 8
  Compaction stalls                    22449       24680       24846       19765       22059       17480
  Compaction success                   12971       14836       14608       10475       11632        8757
  Compaction failures                   9477        9843       10238        9290       10426        8722
  Page migrate success               3109022     3370438     3312164     1695105     1608435     2111379
  Page migrate failure                911588     1149065     1028264     1112675     1077251     1026367
  Compaction pages isolated          7242983     8015530     7782467     4629063     4402787     5377665
  Compaction migrate scanned       980838938   987367943   957690188   917647238   947155598  1018922197
  Compaction free scanned          557926893   598946443   602236894   594024490   541169699   763651731
  Compaction cost                      10243       10578       10304        8286        8398        9440

Compaction stats are mostly within noise until patch 4, which decreases
the number of compactions, and migrations.  Part of that could be due to
more pageblocks marked as unmovable, and async compaction skipping
those.  This changes a bit with patch 7, but not so much.  Patch 8
increases free scanner stats and migrations, which comes from the
changed termination criteria.  Interestingly number of compactions
decreases - probably the fully compacted pageblock satisfies multiple
subsequent allocations, so it amortizes.

Next comes the extfrag tracepoint, where "fragmenting" means that an
allocation had to fallback to a pageblock of another migratetype which
wasn't fully free (which is almost all of the fallbacks).  I have
locally added another tracepoint for "Page steal" into
steal_suitable_fallback() which triggers in situations where we are
allowed to do move_freepages_block().  If we decide to also do
set_pageblock_migratetype(), it's "Pages steal with pageblock" with
break down for which allocation migratetype we are stealing and from
which fallback migratetype.  The last part "due to counting" comes from
patch 4 and counts the events where the counting of movable pages
allowed us to change pageblock's migratetype, while the number of free
pages alone wouldn't be enough to cross the threshold.

                                                       patch 1     patch 2     patch 3     patch 4     patch 7     patch 8
  Page alloc extfrag event                            10155066     8522968    10164959    15622080    13727068    13140319
  Extfrag fragmenting                                 10149231     8517025    10159040    15616925    13721391    13134792
  Extfrag fragmenting for unmovable                     159504      168500      184177       97835       70625       56948
  Extfrag fragmenting unmovable placed with movable     153613      163549      172693       91740       64099       50917
  Extfrag fragmenting unmovable placed with reclaim.      5891        4951       11484        6095        6526        6031
  Extfrag fragmenting for reclaimable                     4738        4829        6345        4822        5640        5378
  Extfrag fragmenting reclaimable placed with movable     1836        1902        1851        1579        1739        1760
  Extfrag fragmenting reclaimable placed with unmov.      2902        2927        4494        3243        3901        3618
  Extfrag fragmenting for movable                      9984989     8343696     9968518    15514268    13645126    13072466
  Pages steal                                           179954      192291      210880      123254       94545       81486
  Pages steal with pageblock                             22153       18943       20154       33562       29969       33444
  Pages steal with pageblock for unmovable               14350       12858       13256       20660       19003       20852
  Pages steal with pageblock for unmovable from mov.     12812       11402       11683       19072       17467       19298
  Pages steal with pageblock for unmovable from recl.     1538        1456        1573        1588        1536        1554
  Pages steal with pageblock for movable                  7114        5489        5965       11787       10012       11493
  Pages steal with pageblock for movable from unmov.      6885        5291        5541       11179        9525       10885
  Pages steal with pageblock for movable from recl.        229         198         424         608         487         608
  Pages steal with pageblock for reclaimable               689         596         933        1115         954        1099
  Pages steal with pageblock for reclaimable from unmov.   273         219         537         658         547         667
  Pages steal with pageblock for reclaimable from mov.     416         377         396         457         407         432
  Pages steal with pageblock due to counting                                                 11834       10075        7530
  ... for unmovable                                                                           8993        7381        4616
  ... for movable                                                                             2792        2653        2851
  ... for reclaimable                                                                           49          41          63

What we can see is that "Extfrag fragmenting for unmovable" and "...
placed with movable" drops with almost each patch, which is good as we
are polluting less movable pageblocks with unmovable pages.

The most significant change is patch 4 with movable page counting.  On
the other hand it increases "Extfrag fragmenting for movable" by 50%.
"Pages steal" drops though, so these movable allocation fallbacks find
only small free pages and are not allowed to steal whole pageblocks
back.  "Pages steal with pageblock" raises, because the patch increases
the chances of pageblock migratetype changes to happen.  This affects
all migratetypes.

The summary is that patch 4 is not a clear win wrt these stats, but I
believe that the tradeoff it makes is a good one.  There's less
pollution of movable pageblocks by unmovable allocations.  There's less
stealing between pageblock, and those that remain have higher chance of
changing migratetype also the pageblock itself, so it should more
faithfully reflect the migratetype of the pages within the pageblock.
The increase of movable allocations falling back to unmovable pageblock
might look dramatic, but those allocations can be migrated by compaction
when needed, and other patches in the series (7-9) improve that aspect.

Patches 7 and 8 continue the trend of reduced unmovable fallbacks and
also reduce the impact on movable fallbacks from patch 4.

[1] https://www.spinics.net/lists/linux-mm/msg114237.html

This patch (of 8):

While currently there are (mostly by accident) no holes in struct
compact_control (on x86_64), but we are going to add more bool flags, so
place them all together to the end of the structure.  While at it, just
order all fields from largest to smallest.

Link: http://lkml.kernel.org/r/20170307131545.28577-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:09 -07:00
Xishi Qiu
a6ffdc0784 mm: use is_migrate_highatomic() to simplify the code
Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().

Simplify the code, no functional changes.

[akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:08 -07:00
Johannes Weiner
c822f6223d mm: delete NR_PAGES_SCANNED and pgdat_reclaimable()
NR_PAGES_SCANNED counts number of pages scanned since the last page free
event in the allocator.  This was used primarily to measure the
reclaimability of zones and nodes, and determine when reclaim should
give up on them.  In that role, it has been replaced in the preceding
patches by a different mechanism.

Being implemented as an efficient vmstat counter, it was automatically
exported to userspace as well.  It's however unlikely that anyone
outside the kernel is using this counter in any meaningful way.

Remove the counter and the unused pgdat_reclaimable().

Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Jia He <hejianet@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:08 -07:00
Johannes Weiner
c73322d098 mm: fix 100% CPU kswapd busyloop on unreclaimable nodes
Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
cleanups".

Jia reported a scenario in which the kswapd of a node indefinitely spins
at 100% CPU usage.  We have seen similar cases at Facebook.

The kernel's current method of judging its ability to reclaim a node (or
whether to back off and sleep) is based on the amount of scanned pages
in proportion to the amount of reclaimable pages.  In Jia's and our
scenarios, there are no reclaimable pages in the node, however, and the
condition for backing off is never met.  Kswapd busyloops in an attempt
to restore the watermarks while having nothing to work with.

This series reworks the definition of an unreclaimable node based not on
scanning but on whether kswapd is able to actually reclaim pages in
MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
the page allocator uses for giving up on direct reclaim and invoking the
OOM killer.  If it cannot free any pages, kswapd will go to sleep and
leave further attempts to direct reclaim invocations, which will either
make progress and re-enable kswapd, or invoke the OOM killer.

Patch #1 fixes the immediate problem Jia reported, the remainder are
smaller fixlets, cleanups, and overall phasing out of the old method.

Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
and directly related to #5, but in itself not relevant to the series.

If the whole series is too ambitious for 4.11, I would consider the
first three patches fixes, the rest cleanups.

This patch (of 9):

Jia He reports a problem with kswapd spinning at 100% CPU when
requesting more hugepages than memory available in the system:

$ echo 4000 >/proc/sys/vm/nr_hugepages

top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3

At that time, there are no reclaimable pages left in the node, but as
kswapd fails to restore the high watermarks it refuses to go to sleep.

Kswapd needs to back away from nodes that fail to balance.  Up until
commit 1d82de618ddd ("mm, vmscan: make kswapd reclaim in terms of
nodes") kswapd had such a mechanism.  It considered zones whose
theoretically reclaimable pages it had reclaimed six times over as
unreclaimable and backed away from them.  This guard was erroneously
removed as the patch changed the definition of a balanced node.

However, simply restoring this code wouldn't help in the case reported
here: there *are* no reclaimable pages that could be scanned until the
threshold is met.  Kswapd would stay awake anyway.

Introduce a new and much simpler way of backing off.  If kswapd runs
through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
page, make it back off from the node.  This is the same number of shots
direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
that node until a direct reclaimer manages to reclaim some pages, thus
proving the node reclaimable again.

[hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
  Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
[shakeelb@google.com: fix condition for throttle_direct_reclaim]
  Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reported-by: Jia He <hejianet@gmail.com>
Tested-by: Jia He <hejianet@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:07 -07:00
Michal Hocko
ce612879dd mm: move pcp and lru-pcp draining into single wq
We currently have 2 specific WQ_RECLAIM workqueues in the mm code.
vmstat_wq for updating pcp stats and lru_add_drain_wq dedicated to drain
per cpu lru caches.  This seems more than necessary because both can run
on a single WQ.  Both do not block on locks requiring a memory
allocation nor perform any allocations themselves.  We will save one
rescuer thread this way.

On the other hand drain_all_pages() queues work on the system wq which
doesn't have rescuer and so this depend on memory allocation (when all
workers are stuck allocating and new ones cannot be created).

Initially we thought this would be more of a theoretical problem but
Hugh Dickins has reported:

: 4.11-rc has been giving me hangs after hours of swapping load.  At
: first they looked like memory leaks ("fork: Cannot allocate memory");
: but for no good reason I happened to do "cat /proc/sys/vm/stat_refresh"
: before looking at /proc/meminfo one time, and the stat_refresh stuck
: in D state, waiting for completion of flush_work like many kworkers.
: kthreadd waiting for completion of flush_work in drain_all_pages().

This worker should be using WQ_RECLAIM as well in order to guarantee a
forward progress.  We can reuse the same one as for lru draining and
vmstat.

Link: http://lkml.kernel.org/r/20170307131751.24936-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Yang Li <pku.leo@gmail.com>
Tested-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-04-08 00:47:49 -07:00
Kirill A. Shutemov
a8fa41ad2f mm, rmap: check all VMAs that PTE-mapped THP can be part of
Current rmap code can miss a VMA that maps PTE-mapped THP if the first
suppage of the THP was unmapped from the VMA.

We need to walk rmap for the whole range of offsets that THP covers, not
only the first one.

vma_address() also need to be corrected to check the range instead of
the first subpage.

Link: http://lkml.kernel.org/r/20170129173858.45174-6-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Kirill A. Shutemov
235190738a oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA
Logic on whether we can reap pages from the VMA should match what we
have in madvise_dontneed().  In particular, we should skip, VM_PFNMAP
VMAs, but we don't now.

Let's just extract condition on which we can shoot down pagesi from a
VMA with MADV_DONTNEED into separate function and use it in both places.

Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
David Rientjes
7f354a548d mm, compaction: add vmstats for kcompactd work
A "compact_daemon_wake" vmstat exists that represents the number of
times kcompactd has woken up.  This doesn't represent how much work it
actually did, though.

It's useful to understand how much compaction work is being done by
kcompactd versus other methods such as direct compaction and explicitly
triggered per-node (or system) compaction.

This adds two new vmstats: "compact_daemon_migrate_scanned" and
"compact_daemon_free_scanned" to represent the number of pages kcompactd
has scanned as part of its migration scanner and freeing scanner,
respectively.

These values are still accounted for in the general
"compact_migrate_scanned" and "compact_free_scanned" for compatibility.

It could be argued that explicitly triggered compaction could also be
tracked separately, and that could be added if others find it useful.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:29 -08:00
Vlastimil Babka
76741e776a mm, page_alloc: don't convert pfn to idx when merging
In __free_one_page() we do the buddy merging arithmetics on "page/buddy
index", which is just the lower MAX_ORDER bits of pfn.  The operations
we do that affect the higher bits are bitwise AND and subtraction (in
that order), where the final result will be the same with the higher
bits left unmasked, as long as these bits are equal for both buddies -
which must be true by the definition of a buddy.

We can therefore use pfn's directly instead of "index" and skip the
zeroing of >MAX_ORDER bits.  This can help a bit by itself, although
compiler might be smart enough already.  It also helps the next patch to
avoid page_to_pfn() for memory hole checks.

Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:27 -08:00
Nicholas Piggin
6290602709 mm: add PageWaiters indicating tasks are waiting for a page bit
Add a new page flag, PageWaiters, to indicate the page waitqueue has
tasks waiting. This can be tested rather than testing waitqueue_active
which requires another cacheline load.

This bit is always set when the page has tasks on page_waitqueue(page),
and is set and cleared under the waitqueue lock. It may be set when
there are no tasks on the waitqueue, which will cause a harmless extra
wakeup check that will clears the bit.

The generic bit-waitqueue infrastructure is no longer used for pages.
Instead, waitqueues are used directly with a custom key type. The
generic code was not flexible enough to have PageWaiters manipulation
under the waitqueue lock (which simplifies concurrency).

This improves the performance of page lock intensive microbenchmarks by
2-3%.

Putting two bits in the same word opens the opportunity to remove the
memory barrier between clearing the lock bit and testing the waiters
bit, after some work on the arch primitives (e.g., ensuring memory
operand widths match and cover both bits).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-25 11:54:48 -08:00
Jan Kara
2994302bc8 mm: add orig_pte field into vm_fault
Add orig_pte field to vm_fault structure to allow ->page_mkwrite
handlers to fully handle the fault.

This also allows us to save some passing of extra arguments around.

Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:09 -08:00
Jan Kara
82b0f8c39a mm: join struct fault_env and vm_fault
Currently we have two different structures for passing fault information
around - struct vm_fault and struct fault_env.  DAX will need more
information in struct vm_fault to handle its faults so the content of
that structure would become event closer to fault_env.  Furthermore it
would need to generate struct fault_env to be able to call some of the
generic functions.  So at this point I don't think there's much use in
keeping these two structures separate.  Just embed into struct vm_fault
all that is needed to use it for both purposes.

Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14 16:04:09 -08:00
Vlastimil Babka
9f7e338793 mm, compaction: make full priority ignore pageblock suitability
Several people have reported premature OOMs for order-2 allocations
(stack) due to OOM rework in 4.7.  In the scenario (parallel kernel
build and dd writing to two drives) many pageblocks get marked as
Unmovable and compaction free scanner struggles to isolate free pages.
Joonsoo Kim pointed out that the free scanner skips pageblocks that are
not movable to prevent filling them and forcing non-movable allocations
to fallback to other pageblocks.  Such heuristic makes sense to help
prevent long-term fragmentation, but premature OOMs are relatively more
urgent problem.  As a compromise, this patch disables the heuristic only
for the ultimate compaction priority.

Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
Reported-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Reported-by: Olaf Hering <olaf@aepfle.de>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vlastimil Babka
06ed29989f mm, compaction: make whole_zone flag ignore cached scanner positions
Patch series "make direct compaction more deterministic")

This is mostly a followup to Michal's oom detection rework, which
highlighted the need for direct compaction to provide better feedback in
reclaim/compaction loop, so that it can reliably recognize when
compaction cannot make further progress, and allocation should invoke
OOM killer or fail.  We've discussed this at LSF/MM [1] where I proposed
expanding the async/sync migration mode used in compaction to more
general "priorities".  This patchset adds one new priority that just
overrides all the heuristics and makes compaction fully scan all zones.
I don't currently think that we need more fine-grained priorities, but
we'll see.  Other than that there's some smaller fixes and cleanups,
mainly related to the THP-specific hacks.

I've tested this with stress-highalloc in GFP_KERNEL order-4 and
THP-like order-9 scenarios.  There's some improvement for compaction
stats for the order-4, which is likely due to the better watermarks
handling.  In the previous version I reported mostly noise wrt
compaction stats, and decreased direct reclaim - now the reclaim is
without difference.  I believe this is due to the less aggressive
compaction priority increase in patch 6.

"before" is a mmotm tree prior to 4.7 release plus the first part of the
series that was sent and merged separately

                                    before        after
order-4:

Compaction stalls                    27216       30759
Compaction success                   19598       25475
Compaction failures                   7617        5283
Page migrate success                370510      464919
Page migrate failure                 25712       27987
Compaction pages isolated           849601     1041581
Compaction migrate scanned       143146541   101084990
Compaction free scanned          208355124   144863510
Compaction cost                       1403        1210

order-9:

Compaction stalls                     7311        7401
Compaction success                    1634        1683
Compaction failures                   5677        5718
Page migrate success                194657      183988
Page migrate failure                  4753        4170
Compaction pages isolated           498790      456130
Compaction migrate scanned          565371      524174
Compaction free scanned            4230296     4250744
Compaction cost                        215         203

[1] https://lwn.net/Articles/684611/

This patch (of 11):

A recent patch has added whole_zone flag that compaction sets when
scanning starts from the zone boundary, in order to report that zone has
been fully scanned in one attempt.  For allocations that want to try
really hard or cannot fail, we will want to introduce a mode where
scanning whole zone is guaranteed regardless of the cached positions.

This patch reuses the whole_zone flag in a way that if it's already
passed true to compaction, the cached scanner positions are ignored.
Employing this flag during reclaim/compaction loop will be done in the
next patch.  This patch however converts compaction invoked from
userspace via procfs to use this flag.  Before this patch, the cached
positions were first reset to zone boundaries and then read back from
struct zone, so there was a window where a parallel compaction could
replace the reset values, making the manual compaction less effective.
Using the flag instead of performing reset is more robust.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka
c3486f5376 mm, compaction: simplify contended compaction handling
Async compaction detects contention either due to failing trylock on
zone->lock or lru_lock, or by need_resched().  Since 1f9efdef4f3f ("mm,
compaction: khugepaged should not give up due to need_resched()") the
code got quite complicated to distinguish these two up to the
__alloc_pages_slowpath() level, so different decisions could be taken
for khugepaged allocations.

After the recent changes, khugepaged allocations don't check for
contended compaction anymore, so we again don't need to distinguish lock
and sched contention, and simplify the current convoluted code a lot.

However, I believe it's also possible to simplify even more and
completely remove the check for contended compaction after the initial
async compaction for costly orders, which was originally aimed at THP
page fault allocations.  There are several reasons why this can be done
now:

- with the new defaults, THP page faults no longer do reclaim/compaction at
  all, unless the system admin has overridden the default, or application has
  indicated via madvise that it can benefit from THP's. In both cases, it
  means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
  wouldn't, the second compaction attempt is still async and will detect the
  contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
  in place that prevent excessive THP page fault latencies

Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
e6cbd7f2ef mm, page_alloc: remove fair zone allocation policy
The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed
to balance a zone.  Reclaim is now node-based so this should no longer
be an issue and the fair zone allocation policy is not free.  This patch
removes it.

Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
a5f5f91da6 mm: convert zone_reclaim to node_reclaim
As reclaim is now per-node based, convert zone_reclaim to be
node_reclaim.  It is possible that a node will be reclaimed multiple
times if it has multiple zones but this is unavoidable without caching
all nodes traversed so far.  The documentation and interface to
userspace is the same from a configuration perspective and will will be
similar in behaviour unless the node-local allocation requests were also
limited to lower zones.

Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman
599d0c954f mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic.  Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes.  It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Kirill A. Shutemov
bae473a423 mm: introduce fault_env
The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context without
endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ebru Akagunduz
8a966ed746 mm: make swapin readahead to improve thp collapse rate
This patch makes swapin readahead to improve thp collapse rate.  When
khugepaged scanned pages, there can be a few of the pages in swap area.

With the patch THP can collapse 4kB pages into a THP when there are up
to max_ptes_swap swap ptes in a 2MB range.

The patch was tested with a test program that allocates 400B of memory,
writes to it, and then sleeps.  I force the system to swap out all.
Afterwards, the test program touches the area by writing, it skips a
page in each 20 pages of the area.

Without the patch, system did not swap in readahead.  THP rate was %65
of the program of the memory, it did not change over time.

With this patch, after 10 minutes of waiting khugepaged had collapsed
%99 of the program's memory.

[kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
[kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
[kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim
46f24fd857 mm/page_alloc: introduce post allocation processing on page allocator
This patch is motivated from Hugh and Vlastimil's concern [1].

There are two ways to get freepage from the allocator.  One is using
normal memory allocation API and the other is __isolate_free_page()
which is internally used for compaction and pageblock isolation.  Later
usage is rather tricky since it doesn't do whole post allocation
processing done by normal API.

One problematic thing I already know is that poisoned page would not be
checked if it is allocated by __isolate_free_page().  Perhaps, there
would be more.

We could add more debug logic for allocated page in the future and this
separation would cause more problem.  I'd like to fix this situation at
this time.  Solution is simple.  This patch commonize some logic for
newly allocated page and uses it on all sites.  This will solve the
problem.

[1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

[iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
  Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
  Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Mel Gorman
e838a45f93 mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
Commit d0164adc89f6 ("mm, page_alloc: distinguish between being unable
to sleep, unwilling to sleep and avoiding waking kswapd") modified
__GFP_WAIT to explicitly identify the difference between atomic callers
and those that were unwilling to sleep.  Later the definition was
removed entirely.

The GFP_RECLAIM_MASK is the set of flags that affect watermark checking
and reclaim behaviour but __GFP_ATOMIC was never added.  Without it,
atomic users of the slab allocator strip the __GFP_ATOMIC flag and
cannot access the page allocator atomic reserves.  This patch addresses
the problem.

The user-visible impact depends on the workload but potentially atomic
allocations unnecessarily fail without this path.

Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Marcin Wojtas <mw@semihalf.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>	[4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Michal Hocko
9fbeb5ab59 mm: make vm_mmap killable
All the callers of vm_mmap seem to check for the failure already and
bail out in one way or another on the error which means that we can
change it to use killable version of vm_mmap_pgoff and return -EINTR if
the current task gets killed while waiting for mmap_sem.  This also
means that vm_mmap_pgoff can be killable by default and drop the
additional parameter.

This will help in the OOM conditions when the oom victim might be stuck
waiting for the mmap_sem for write which in turn can block oom_reaper
which relies on the mmap_sem for read to make a forward progress and
reclaim the address space of the victim.

Please note that load_elf_binary is ignoring vm_mmap error for
current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
problem because the address is not used anywhere and we never return to
the userspace if we got killed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Michal Hocko
dc0ef0df7b mm: make mmap_sem for write waits killable for mm syscalls
This is a follow up work for oom_reaper [1].  As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way.  This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.

Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending).  Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.

As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).

This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree.  I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.

I haven't covered all the mmap_write(mm->mmap_sem) instances here

  $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
  98
  $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
  62

I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement.  Other
places can be changed on top.

[0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
[1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

This patch (of 18):

This is the first step in making mmap_sem write waiters killable.  It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.

Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR.  This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress.  E.g.  the oom reaper will need the lock for reading to
dismantle the OOM victim address space.

The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
original non-killable semantic for now.

vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Michal Hocko
c8f7de0bfa mm, compaction: distinguish between full and partial COMPACT_COMPLETE
COMPACT_COMPLETE now means that compaction and free scanner met.  This
is not very useful information if somebody just wants to use this
feedback and make any decisions based on that.  The current caller might
be a poor guy who just happened to scan tiny portion of the zone and
that could be the reason no suitable pages were compacted.  Make sure we
distinguish the full and partial zone walks.

Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.

The existing users of COMPACT_COMPLETE are conservatively changed to use
COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Mel Gorman
93ea9964d1 mm, page_alloc: remove field from alloc_context
The classzone_idx can be inferred from preferred_zoneref so remove the
unnecessary field and save stack space.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
c33d6c06f6 mm, page_alloc: avoid looking up the first zone in a zonelist twice
The allocator fast path looks up the first usable zone in a zonelist and
then get_page_from_freelist does the same job in the zonelist iterator.
This patch preserves the necessary information.

                                             4.6.0-rc2                  4.6.0-rc2
                                        fastmark-v1r20             initonce-v1r20
  Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
  Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
  Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
  Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
  Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
  Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
  Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
  Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
  Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
  Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
  Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
  Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
  Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
  Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
  Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)

The benefit is negligible and the results are within the noise but each
cycle counts.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Mel Gorman
c603844bdc mm, page_alloc: convert alloc_flags to unsigned
alloc_flags is a bitmask of flags but it is signed which does not
necessarily generate the best code depending on the compiler.  Even
without an impact, it makes more sense that this be unsigned.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Joonsoo Kim
0139aa7b7f mm: rename _count, field of the struct page, to _refcount
Many developers already know that field for reference count of the
struct page is _count and atomic type.  They would try to handle it
directly and this could break the purpose of page reference count
tracepoint.  To prevent direct _count modification, this patch rename it
to _refcount and add warning message on the code.  After that, developer
who need to handle reference count will find that field should not be
accessed directly.

[akpm@linux-foundation.org: fix comments, per Vlastimil]
[akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
[sfr@canb.auug.org.au: sync ethernet driver changes]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Sunil Goutham <sgoutham@cavium.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Manish Chopra <manish.chopra@qlogic.com>
Cc: Yuval Mintz <yuval.mintz@qlogic.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Michal Hocko
aac4536355 mm, oom: introduce oom reaper
This patch (of 5):

This is based on the idea from Mel Gorman discussed during LSFMM 2015
and independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g.  by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event (e.g.
lock) which is blocked by another task looping in the page allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory owned
by the oom victim under an assumption that such a memory won't be needed
when its owner is killed and kicked from the userspace anyway.  There is
one notable exception to this, though, if the OOM victim was in the
process of coredumping the result would be incomplete.  This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g.  allocating memory).  Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt.  basically any
lock blocking forward progress as described above.  In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.  Users of
mmap_sem which need it for write should be carefully reviewed to use
_killable waiting as much as possible and reduce allocations requests
done with the lock held to absolute minimum to reduce the risk even
further.

The API between oom killer and oom reaper is quite trivial.
wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
NULL->mm transition and oom_reaper clear this atomically once it is done
with the work.  This means that only a single mm_struct can be reaped at
the time.  As the operation is potentially disruptive we are trying to
limit it to the ncessary minimum and the reaper blocks any updates while
it operates on an mm.  mm_struct is pinned by mm_count to allow parallel
exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Joe Perches
1170532bb4 mm: convert printk(KERN_<LEVEL> to pr_<level>
Most of the mm subsystem uses pr_<level> so make it consistent.

Miscellanea:

 - Realign arguments
 - Add missing newline to format
 - kmemleak-test.c has a "kmemleak: " prefix added to the
   "Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joonsoo Kim
fe896d1878 mm: introduce page reference manipulation functions
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count.  Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it.  Then, it is hard to find actual
reason of CMA allocation failure.  CMA allocation should be guaranteed
to succeed so finding offending place is really important.

In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function.  This is preparation step to
add tracepoint to each page reference manipulation function.  With this
facility, we can easily find reason of CMA allocation failure.  There is
no functional change in this patch.

In addition, this patch also converts reference read sites.  It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Vlastimil Babka
accf62422b mm, kswapd: replace kswapd compaction with waking up kcompactd
Similarly to direct reclaim/compaction, kswapd attempts to combine
reclaim and compaction to attempt making memory allocation of given
order available.

The details differ from direct reclaim e.g. in having high watermark as
a goal.  The code involved in kswapd's reclaim/compaction decisions has
evolved to be quite complex.

Testing reveals that it doesn't actually work in at least one scenario,
and closer inspection suggests that it could be greatly simplified
without compromising on the goal (make high-order page available) or
efficiency (don't reclaim too much).  The simplification relieas of
doing all compaction in kcompactd, which is simply woken up when high
watermarks are reached by kswapd's reclaim.

The scenario where kswapd compaction doesn't work was found with mmtests
test stress-highalloc configured to attempt order-9 allocations without
direct reclaim, just waking up kswapd.  There was no compaction attempt
from kswapd during the whole test.  Some added instrumentation shows
what happens:

 - balance_pgdat() sets end_zone to Normal, as it's not balanced
 - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
   it cannot reclaim anything, so sc.nr_reclaimed is 0
 - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
   it merely checks if high watermarks were reached for base pages.
   This is true, so no reclaim is attempted.  For DMA, testorder=0
   wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
 - even though the pgdat_needs_compaction flag wasn't set to false, no
   compaction happens due to the condition sc.nr_reclaimed >
   nr_attempted being false (as 0 < 99)
 - priority-- due to nr_reclaimed being 0, repeat until priority reaches
   0 pgdat_balanced() is false as only the small zone DMA appears
   balanced (curiously in that check, watermark appears OK and
   compaction_suitable() returns COMPACT_PARTIAL, because a lower
   classzone_idx is used there)

Now, even if it was decided that reclaim shouldn't be attempted on the
DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
nr_attempted=0) is also false.  The condition really should use >= as
the comment suggests.  Then there is a mismatch in the check for setting
pgdat_needs_compaction to false using low watermark, while the rest uses
high watermark, and who knows what other subtlety.  Hopefully this
demonstrates that this is unsustainable.

Luckily we can simplify this a lot.  The reclaim/compaction decisions
make sense for direct reclaim scenario, but in kswapd, our primary goal
is to reach high watermark in order-0 pages.  Afterwards we can attempt
compaction just once.  Unlike direct reclaim, we don't reclaim extra
pages (over the high watermark), the current code already disallows it
for good reasons.

After this patch, we simply wake up kcompactd to process the pgdat,
after we have either succeeded or failed to reach the high watermarks in
kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
so kcompactd can apply the same criteria to determine which zones are
worth compacting.  Note that we use the classzone_idx from
wakeup_kswapd(), not balanced_classzone_idx which can include higher
zones that kswapd tried to balance too, but didn't consider them in
pgdat_balanced().

Since kswapd now cannot create high-order pages itself, we need to
adjust how it determines the zones to be balanced.  The key element here
is adding a "highorder" parameter to zone_balanced, which, when set to
false, makes it consider only order-0 watermark instead of the desired
higher order (this was done previously by kswapd_shrink_zone(), but not
elsewhere).  This false is passed for example in pgdat_balanced().
Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
kcompactd are woken up for a high-order allocation failure.

The last thing is to decide what to do with pageblock_skip bitmap
handling.  Compaction maintains a pageblock_skip bitmap to record
pageblocks where isolation recently failed.  This bitmap can be reset by
three ways:

1) direct compaction is restarting after going through the full deferred cycle

2) kswapd goes to sleep, and some other direct compaction has previously
   finished scanning the whole zone and set zone->compact_blockskip_flush.
   Note that a successful direct compaction clears this flag.

3) compaction was invoked manually via trigger in /proc

The case 2) is somewhat fuzzy to begin with, but after introducing
kcompactd we should update it.  The check for direct compaction in 1),
and to set the flush flag in 2) use current_is_kswapd(), which doesn't
work for kcompactd.  Thus, this patch adds bool direct_compaction to
compact_control to use in 2).  For the case 1) we remove the check
completely - unlike the former kswapd compaction, kcompactd does use the
deferred compaction functionality, so flushing tied to restarting from
deferred compaction makes sense here.

Note that when kswapd goes to sleep, kcompactd is woken up, so it will
see the flushed pageblock_skip bits.  This is different from when the
former kswapd compaction observed the bits and I believe it makes more
sense.  Kcompactd can afford to be more thorough than a direct
compaction trying to limit allocation latency, or kswapd whose primary
goal is to reclaim.

For testing, I used stress-highalloc configured to do order-9
allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
phases 1 and 2 work as usual):

stress-highalloc
                        4.5-rc1+before          4.5-rc1+after
                             -nodirect              -nodirect
Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)

User                          3166.67        3181.09
System                        1153.37        1158.25
Elapsed                       1768.53        1799.37

                            4.5-rc1+before   4.5-rc1+after
                                 -nodirect    -nodirect
Direct pages scanned                32938        32797
Kswapd pages scanned              2183166      2202613
Kswapd pages reclaimed            2152359      2143524
Direct pages reclaimed              32735        32545
Percentage direct scans                1%           1%
THP fault alloc                       579          612
THP collapse alloc                    304          316
THP splits                              0            0
THP fault fallback                    793          778
THP collapse fail                      11           16
Compaction stalls                    1013         1007
Compaction success                     92           67
Compaction failures                   920          939
Page migrate success               238457       721374
Page migrate failure                23021        23469
Compaction pages isolated          504695      1479924
Compaction migrate scanned         661390      8812554
Compaction free scanned          13476658     84327916
Compaction cost                       262          838

After this patch we see improvements in allocation success rate
(especially for phase 3) along with increased compaction activity.  The
compaction stalls (direct compaction) in the interfering kernel builds
(probably THP's) also decreased somewhat thanks to kcompactd activity,
yet THP alloc successes improved a bit.

Note that elapsed and user time isn't so useful for this benchmark,
because of the background interference being unpredictable.  It's just
to quickly spot some major unexpected differences.  System time is
somewhat more useful and that didn't increase.

Also (after adjusting mmtests' ftrace monitor):

Time kswapd awake               2547781     2269241
Time kcompactd awake                  0      119253
Time direct compacting           939937      557649
Time kswapd compacting                0           0
Time kcompactd compacting             0      119099

The decrease of overal time spent compacting appears to not match the
increased compaction stats.  I suspect the tasks get rescheduled and
since the ftrace monitor doesn't see that, the reported time is wall
time, not CPU time.  But arguably direct compactors care about overall
latency anyway, whether busy compacting or waiting for CPU doesn't
matter.  And that latency seems to almost halved.

It's also interesting how much time kswapd spent awake just going
through all the priorities and failing to even try compacting, over and
over.

We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

stress-highalloc
                        4.5-rc1+before         4.5-rc1+after
                               -direct               -direct
Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)

User                          3344.73       3246.04
System                        1194.24       1172.29
Elapsed                       1838.04       1836.76

                            4.5-rc1+before  4.5-rc1+after
                                   -direct     -direct
Direct pages scanned               125146      120966
Kswapd pages scanned              2119757     2135012
Kswapd pages reclaimed            2073183     2108388
Direct pages reclaimed             124909      120577
Percentage direct scans                5%          5%
THP fault alloc                       599         652
THP collapse alloc                    323         354
THP splits                              0           0
THP fault fallback                    806         793
THP collapse fail                      17          16
Compaction stalls                    2457        2025
Compaction success                    906         518
Compaction failures                  1551        1507
Page migrate success              2031423     2360608
Page migrate failure                32845       40852
Compaction pages isolated         4129761     4802025
Compaction migrate scanned       11996712    21750613
Compaction free scanned         214970969   344372001
Compaction cost                      2271        2694

In this scenario, this patch doesn't change the overall success rate as
direct compaction already tries all it can.  There's however significant
reduction in direct compaction stalls (that is, the number of
allocations that went into direct compaction).  The number of successes
(i.e.  direct compaction stalls that ended up with successful
allocation) is reduced by the same number.  This means the offload to
kcompactd is working as expected, and direct compaction is reduced
either due to detecting contention, or compaction deferred by kcompactd.
In the previous version of this patchset there was some apparent
reduction of success rate, but the changes in this version (such as
using sync compaction only), new baseline kernel, and/or averaging
results from 5 executions (my bet), made this go away.

Ftrace-based stats seem to roughly agree:

Time kswapd awake               2532984     2326824
Time kcompactd awake                  0      257916
Time direct compacting           864839      735130
Time kswapd compacting                0           0
Time kcompactd compacting             0      257585

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00