mirror of
https://github.com/rd-stuffs/msm-4.14.git
synced 2025-02-20 11:45:48 +08:00
13459 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
f59653f6d3
|
percpu: use chunk scan_hint to skip some scanning
Just like blocks, chunks now maintain a scan_hint. This can be used to skip some scanning by promoting the scan_hint to be the contig_hint. The chunk's scan_hint is primarily updated on the backside and relies on full scanning when a block becomes free or the free region spans across blocks. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
e28f68efe5
|
percpu: convert chunk hints to be based on pcpu_block_md
As mentioned in the last patch, a chunk's hints are no different than a block just responsible for more bits. This converts chunk level hints to use a pcpu_block_md to maintain them. This lets us reuse the same hint helper functions as a block. The left_free and right_free are unused by the chunk's pcpu_block_md. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
9c466f2de6
|
percpu: make pcpu_block_md generic
In reality, a chunk is just a block covering a larger number of bits. The hints themselves are one in the same. Rather than maintaining the hints separately, first introduce nr_bits to genericize pcpu_block_update() to correctly maintain block->right_free. The next patch will convert chunk hints to be managed as a pcpu_block_md. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
793fac2a5f
|
percpu: use block scan_hint to only scan forward
Blocks now remember the latest scan_hint. This can be used on the allocation path as when a contig_hint is broken, we can promote the scan_hint to the contig_hint and scan forward from there. This works because pcpu_block_refresh_hint() is only called on the allocation path while block free regions are updated manually in pcpu_block_update_hint_free(). Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
f381bb8342
|
percpu: remember largest area skipped during allocation
Percpu allocations attempt to do first fit by scanning forward from the first_free of a block. However, fragmentation from allocation requests can cause holes not seen by block hint update functions. To address this, create a local version of bitmap_find_next_zero_area_off() that remembers the largest area skipped over. The caveat is that it only sees regions skipped over due to not fitting, not regions skipped due to alignment. Prior to updating the scan_hint, a scan backwards is done to try and recover free bits skipped due to alignment. While this can cause scanning to miss earlier possible free areas, smaller allocations will eventually fill those holes due to first fit. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
de6f43fc0c
|
percpu: add block level scan_hint
Fragmentation can cause both blocks and chunks to have an early first_firee bit available, but only able to satisfy allocations much later on. This patch introduces a scan_hint to help mitigate some unnecessary scanning. The scan_hint remembers the largest area prior to the contig_hint. If the contig_hint == scan_hint, then scan_hint_start > contig_hint_start. This is necessary for scan_hint discovery when refreshing a block. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
b77aeaecbc
|
percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
Previously, block size was flexible based on the constraint that the GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the overhead that keeping a floating number of populated free pages required scanning over the free regions of a chunk. Setting the block size to be fixed at PAGE_SIZE lets us know when an empty page becomes used as we will break a full contig_hint of a block. This means we no longer have to scan the whole chunk upon breaking a contig_hint which empty page management piggybacked off. A later patch takes advantage of this to optimize the allocation path by only scanning forward using the scan_hint introduced later too. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
0f08fa64d6
|
percpu: relegate chunks unusable when failing small allocations
In certain cases, requestors of percpu memory may want specific alignments. However, it is possible to end up in situations where the contig_hint matches, but the alignment does not. This causes excess scanning of chunks that will fail. To prevent this, if a small allocation fails (< 32B), the chunk is moved to the empty list. Once an allocation is freed from that chunk, it is placed back into rotation. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
ac57e64155
|
percpu: manage chunks based on contig_bits instead of free_bytes
When a chunk becomes fragmented, it can end up having a large number of small allocation areas free. The free_bytes sorting of chunks leads to unnecessary checking of chunks that cannot satisfy the allocation. Switch to contig_bits sorting to prevent scanning chunks that may not be able to service the allocation request. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
c0ce971830
|
percpu: introduce helper to determine if two regions overlap
While block hints were always accurate, it's possible when spanning across blocks that we miss updating the chunk's contig_hint. Rather than rely on correctness of the boundaries of hints, do a full overlap comparison. A future patch introduces the scan_hint which makes the contig_hint slightly fuzzy as they can at times be smaller than the actual hint. Signed-off-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
a2b7117deb
|
percpu: update free path with correct new free region
When updating the chunk's contig_hint on the free path of a hint that does not touch the page boundaries, it was incorrectly using the starting offset of the free region and the block's contig_hint. This could lead to incorrect assumptions about fit given a size and better alignment of the start. Fix this by using (end - start) as this is only called when updating a hint within a block. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
9150993206
|
mm/page_alloc: simplify page_is_buddy() for better code readability
Simplify page_is_buddy() to reduce the redundant code for better code readability. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/1583853751-5525-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
fe53aea1d0
|
mm/vmalloc.c: remove unnecessary highmem_mask from parameter of gfpflags_allow_blocking()
gfpflags_allow_blocking() does not care about __GFP_HIGHMEM, so highmem_mask can be removed. Link: http://lkml.kernel.org/r/1568812319-3467-1-git-send-email-liuxiang_1999@126.com Signed-off-by: Liu Xiang <liuxiang_1999@126.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
9fdece4b65
|
mm, page_alloc: spread allocations across zones before introducing fragmentation
Patch series "Fragmentation avoidance improvements", v5. It has been noted before that fragmentation avoidance (aka anti-fragmentation) is not perfect. Given sufficient time or an adverse workload, memory gets fragmented and the long-term success of high-order allocations degrades. This series defines an adverse workload, a definition of external fragmentation events (including serious) ones and a series that reduces the level of those fragmentation events. The details of the workload and the consequences are described in more detail in the changelogs. However, from patch 1, this is a high-level summary of the adverse workload. The exact details are found in the mmtests implementation. The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch) 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameterr create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed 3. Warm up a number of fio read-only threads accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll fault back in old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup Overall the series reduces external fragmentation causing events by over 94% on 1 and 2 socket machines, which in turn impacts high-order allocation success rates over the long term. There are differences in latencies and high-order allocation success rates. Latencies are a mixed bag as they are vulnerable to exact system state and whether allocations succeeded so they are treated as a secondary metric. Patch 1 uses lower zones if they are populated and have free memory instead of fragmenting a higher zone. It's special cased to handle a Normal->DMA32 fallback with the reasons explained in the changelog. Patch 2-4 boosts watermarks temporarily when an external fragmentation event occurs. kswapd wakes to reclaim a small amount of old memory and then wakes kcompactd on completion to recover the system slightly. This introduces some overhead in the slowpath. The level of boosting can be tuned or disabled depending on the tolerance for fragmentation vs allocation latency. Patch 5 stalls some movable allocation requests to let kswapd from patch 4 make some progress. The duration of the stalls is very low but it is possible to tune the system to avoid fragmentation events if larger stalls can be tolerated. The bulk of the improvement in fragmentation avoidance is from patches 1-4 but patch 5 can deal with a rare corner case and provides the option of tuning a system for THP allocation success rates in exchange for some stalls to control fragmentation. This patch (of 5): The page allocator zone lists are iterated based on the watermarks of each zone which does not take anti-fragmentation into account. On x86, node 0 may have multiple zones while other nodes have one zone. A consequence is that tasks running on node 0 may fragment ZONE_NORMAL even though ZONE_DMA32 has plenty of free memory. This patch special cases the allocator fast path such that it'll try an allocation from a lower local zone before fragmenting a higher zone. In this case, stealing of pageblocks or orders larger than a pageblock are still allowed in the fast path as they are uninteresting from a fragmentation point of view. This was evaluated using a benchmark designed to fragment memory before attempting THP allocations. It's implemented in mmtests as the following configurations configs/config-global-dhp__workload_thpfioscale configs/config-global-dhp__workload_thpfioscale-defrag configs/config-global-dhp__workload_thpfioscale-madvhugepage e.g. from mmtests ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1 The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch). 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameter create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed. 3. Warm up a number of fio read-only processes accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll refault old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds. 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup the test files. Note that due to the use of IO and page cache that this benchmark is not suitable for running on large machines where the time to fragment memory may be excessive. Also note that while this is one mix that generates fragmentation that it's not the only mix that generates fragmentation. Differences in workload that are more slab-intensive or whether SLUB is used with high-order pages may yield different results. When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered to be an "external fragmentation event" that may cause issues in the future. Hence, the primary metric here is the number of external fragmentation events that occur with order < 9. The secondary metric is allocation latency and huge page allocation success rates but note that differences in latencies and what the success rate also can affect the number of external fragmentation event which is why it's a secondary metric. 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%* Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) Fault latencies are slightly reduced while allocation success rates remain at zero as this configuration does not make any special effort to allocate THP and fio is heavily active at the time and either filling memory or keeping pages resident. However, a 49% reduction of serious fragmentation events reduces the changes of external fragmentation being a problem in the future. Vlastimil asked during review for a breakdown of the allocation types that are falling back. vanilla 3816 MIGRATE_UNMOVABLE 800845 MIGRATE_MOVABLE 33 MIGRATE_UNRECLAIMABLE patch 735 MIGRATE_UNMOVABLE 408135 MIGRATE_MOVABLE 42 MIGRATE_UNRECLAIMABLE The majority of the fallbacks are due to movable allocations and this is consistent for the workload throughout the series so will not be presented again as the primary source of fallbacks are movable allocations. Movable fallbacks are sometimes considered "ok" to fallback because they can be migrated. The problem is that they can fill an unmovable/reclaimable pageblock causing those allocations to fallback later and polluting pageblocks with pages that cannot move. If there is a movable fallback, it is pretty much guaranteed to affect an unmovable/reclaimable pageblock and while it might not be enough to actually cause a unmovable/reclaimable fallback in the future, we cannot know that in advance so the patch takes the only option available to it. Hence, it's important to control them. This point is also consistent throughout the series and will not be repeated. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%) Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%) thpfioscale Percentage Faults Huge 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%) Fragmentation events were reduced quite a bit although this is known to be a little variable. The latencies and allocation success rates are similar but they were already quite high. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%) Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%) 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%) The reduction of external fragmentation events is slight and this is partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations can now spill over to remote nodes instead of fragmenting local memory. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%) Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%* thpfioscale Percentage Faults Huge 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%) There was a slight reduction in external fragmentation events although the latencies were higher. The allocation success rate is high enough that the system is struggling and there is quite a lot of parallel reclaim and compaction activity. There is also a certain degree of luck on whether processes start on node 0 or not for this patch but the relevance is reduced later in the series. Overall, the patch reduces the number of external fragmentation causing events so the success of THP over long periods of time would be improved for this adverse workload. Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [RealJohnGalt]: adapt for 4.14 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
c9f283e9d0
|
ANDROID: mm: use alloc_flags for cma first alloc policy
rmqueue internal functions to allocate cma memory should use alloc_flags instead of gfp_flags because it's more restricted flags to be considered allocation context. Otherwise, we could allocate page from CMA area even though current context already disable cma memory allocation. For example, current allocation context is limited not to allocate the page from CMA area by PF_MEMALLOC_NOCMA to prevent longterm pin but it's ignored so the longterm pin page could be allocated from CMA area. Bug: 178019362 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I61bd4642c91ecd9153f6c59f89e296e8b515f1ad Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
3f83025b10
|
mm/vmalloc: Fix unlock order in s_stop()
When multiple locks are acquired, they should be released in reverse order. For s_start() and s_stop() in mm/vmalloc.c, that is not the case. s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock); s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock); This unlock sequence, though allowed, is not optimal. If a waiter is present, mutex_unlock() will need to go through the slowpath of waking up the waiter with preemption disabled. Fix that by releasing the spinlock first before the mutex. Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com Fixes: e36176be1c39 ("mm/vmalloc: rework vmap_area_lock") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
06a6d6c51e |
mm/mmap.c: fix missing call to vm_unacct_memory in mmap_region
[ Upstream commit 7f82f922319ede486540e8746769865b9508d2c2 ] Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory twice because vm_unacct_memory will be called by above unmap_region. But since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory anymore. So charged shouldn't be set to 0 now otherwise the calling to paired vm_unacct_memory will be missed and leads to imbalanced account. Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
fd55aed621
|
ANDROID: mm: cma: retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. One example of where _refcount can be greater than _mapcount for a page we would not expect to be pinned is inside of copy_one_pte(), which is called during a fork. For ptes for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. If the process doing copy_one_pte() is context switched out after incrementing _refcount but before incrementing _mapcount, then the page will be temporarily pinned. So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries with sleeps to give the system an opportunity to unpin any pinned pages. Additionally, based off feedback by Minchan Kim, add the ability to exit early if a fatal signal is pending (this is a delta from the mailing-list version of this patch). Bug: 168521646 Link: https://lore.kernel.org/lkml/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/ Signed-off-by: Chris Goldsworthy <cgoldswo@codeaurora.org> Co-developed-by: Susheel Khiani <skhiani@codeaurora.org> Signed-off-by: Susheel Khiani <skhiani@codeaurora.org> Co-developed-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Change-Id: I2f0c8388f9163e0decd631d9ae07bb6ad9ab79c8 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
c5ae090ed2
|
Revert "mm: cma: sleep between retries in cma_alloc"
This reverts commit 8ffd0015dd9e44b597acd5c436d8de14de9de648. Signed-off-by: Kazuki Hashimoto <kazukih@tuta.io> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
1639de41dc
|
Revert "mm: cma: Increase retries if less blocks available"
This reverts commit 5a883fb75e3aa713c4e59ca78ab9acd09b78890e. Signed-off-by: Kazuki Hashimoto <kazukih@tuta.io> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
b885283394
|
Revert "mm: cma: retry only on EBUSY"
This reverts commit 54261b6432159003546e392c1f2a3fa94ed43240. Signed-off-by: Kazuki Hashimoto <kazukih@tuta.io> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
b58f06a371
|
ANDROID: mm: do not try test_page_isoalte if migration fails
Currently, alloc_contig_range expects that even though a page fails with -EBUSY from __alloc_contig_migrate_range, it want to check those failed pages in test_pages_isolated again with hope that those page would be freed soon so cma allocatoin would be succeeded. However, it depends on the luck and I found sometimes test_page_isolated constantly fails at the page repeatedly whenever cma_alloc retried. Rather than burning out CPU to check the page's status in test_pages_isolated for GFP_NORETRY allocation, just bail out and relies on the user what they want to do. Currently, this option works for only __GFP_NORETRY case for safe of existing other users. Bug: 192475091 Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I9211452be06960dc7d8f854537e53b3fc5262c8e Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
b87443c8c0
|
FROMGIT: mm: remove lru_add_drain_all in alloc_contig_range
__alloc_contig_migrate_range already has lru_add_drain_all call via migrate_prep. It's necessary to move LRU taget pages into LRU list to be able to isolated. However, lru_add_drain_all call after __alloc_contig_migrate_range is pointless since it has changed source page freeing from putback_lru_pages to put_page[1]. This patch removes it. [1] c6c919eb90e0, ("mm: use put_page() to free page instead of putback_lru_page()" Link: https://lkml.kernel.org/r/20210303204512.2863087-1-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Minchan Kim <minchan@google.com> Bug: 181887812 (cherry picked from https://lore.kernel.org/mm-commits/20210303230112.sxLQLXDXF%25akpm@linux-foundation.org/) Change-Id: I53b6a7a0499d6580fb6febb7ae6ec12c3f871224 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
6ac613581b
|
Merge remote-tracking branch 'google/android-4.14-stable' into richelieu
* google/android-4.14-stable: FROMLIST: binder: fix UAF of ref->proc caused by race condition UPSTREAM: x86/pci: Fix the function type for check_reserved_t Linux 4.14.290 PCI: hv: Fix interrupt mapping for multi-MSI PCI: hv: Reuse existing IRTE allocation in compose_msi_msg() PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI PCI: hv: Fix multi-MSI to allow more than one MSI vector net: usb: ax88179_178a needs FLAG_SEND_ZLP tty: use new tty_insert_flip_string_and_push_buffer() in pty_write() tty: extract tty_flip_buffer_commit() from tty_flip_buffer_push() tty: drop tty_schedule_flip() tty: the rest, stop using tty_schedule_flip() tty: drivers/tty/, stop using tty_schedule_flip() Bluetooth: Fix bt_skb_sendmmsg not allocating partial chunks Bluetooth: SCO: Fix sco_send_frame returning skb->len Bluetooth: Fix passing NULL to PTR_ERR Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg Bluetooth: SCO: Replace use of memcpy_from_msg with bt_skb_sendmsg Bluetooth: Add bt_skb_sendmmsg helper Bluetooth: Add bt_skb_sendmsg helper ALSA: memalloc: Align buffer allocations in page size tilcdc: tilcdc_external: fix an incorrect NULL check on list iterator drm/tilcdc: Remove obsolete crtc_mode_valid() hack bpf: Make sure mac_header was set before using it mm/mempolicy: fix uninit-value in mpol_rebind_policy() Revert "Revert "char/random: silence a lockdep splat with printk()"" be2net: Fix buffer overflow in be_get_module_eeprom tcp: Fix a data-race around sysctl_tcp_notsent_lowat. igmp: Fix a data-race around sysctl_igmp_max_memberships. igmp: Fix data-races around sysctl_igmp_llm_reports. net: stmmac: fix dma queue left shift overflow issue i2c: cadence: Change large transfer count reset logic to be unconditional tcp: Fix a data-race around sysctl_tcp_probe_interval. tcp: Fix a data-race around sysctl_tcp_probe_threshold. tcp/dccp: Fix a data-race around sysctl_tcp_fwmark_accept. ip: Fix a data-race around sysctl_fwmark_reflect. perf/core: Fix data race between perf_event_set_output() and perf_mmap_close() power/reset: arm-versatile: Fix refcount leak in versatile_reboot_probe xfrm: xfrm_policy: fix a possible double xfrm_pols_put() in xfrm_bundle_lookup() xen/gntdev: Ignore failure to unmap INVALID_GRANT_HANDLE |
||
|
3a03bd4b85
|
mm/swap.c: complete 5a9bd6bb520
Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
fbbd87b7f6
|
mm, page_alloc: double zone's batchsize
To improve page allocator's performance for order-0 pages, each CPU has a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, PCP will be checked first before asking pages from Buddy. When PCP is used up, a batch of pages will be fetched from Buddy to improve performance and the size of batch can affect performance. zone's batch size gets doubled last time by commit ba56e91c9401("mm: page_alloc: increase size of per-cpu-pages") over ten years ago. Since then, CPU has envolved a lot and CPU's cache sizes also increased. Dave Hansen is concerned the current batch size doesn't fit well with modern hardware and suggested me to do two things: first, use a page allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find out how performance changes with different batch sizes on various machines and then choose a new default batch size; second, see how this new batch size work with other workloads. In the first test, we saw performance gains on high-core-count systems and little to no effect on older systems with more modest core counts. In this phase's test data, two candidates: 63 and 127 are chosen. In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability and more will-it-scale sub-tests are tested to see how these two candidates work with these workloads and decides a new default according to their results. Most test results are flat. will-it-scale/page_fault2 process mode has 10%-18% performance increase on 4-sockets Skylake and Broadwell. vm-scalability/lru-file-mmap-read has 17%-47% performance increase for 4-sockets servers while for 2-sockets servers, it caused 3%-8% performance drop. Further analysis showed that, with a larger pcp->batch and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch is maintained in this patch), zone lock contention shifted to LRU add side lock contention and that caused performance drop. This performance drop might be mitigated by others' work on optimizing LRU lock. Another downside of increasing pcp->batch is, when PCP is used up and need to fetch a batch of pages from Buddy, since batch is increased, that time can be longer than before. My understanding is, this doesn't affect slowpath where direct reclaim and compaction dominates. For fastpath, throughput is a win(according to will-it-scale/page_fault1) but worst latency can be larger now. Overall, I think double the batch size from 31 to 63 is relatively safe and provide good performance boost for high-core-count systems. The two phase's test results are listed below(all tests are done with THP disabled). Phase one(will-it-scale/page_fault1) test results: Skylake-EX: increased batch size has a good effect on zone->lock contention, though LRU contention will rise at the same time and limited the final performance increase. batch score change zone_contention lru_contention total_contention 31 15345900 +0.00% 64% 8% 72% 53 17903847 +16.67% 32% 38% 70% 63 17992886 +17.25% 24% 45% 69% 73 18022825 +17.44% 10% 61% 71% 119 18023401 +17.45% 4% 66% 70% 127 18029012 +17.48% 3% 66% 69% 137 18036075 +17.53% 4% 66% 70% 165 18035964 +17.53% 2% 67% 69% 188 18101105 +17.95% 2% 67% 69% 223 18130951 +18.15% 2% 67% 69% 255 18118898 +18.07% 2% 67% 69% 267 18101559 +17.96% 2% 67% 69% 299 18160468 +18.34% 2% 68% 70% 320 18139845 +18.21% 2% 67% 69% 393 18160869 +18.34% 2% 68% 70% 424 18170999 +18.41% 2% 68% 70% 458 18144868 +18.24% 2% 68% 70% 467 18142366 +18.22% 2% 68% 70% 498 18154549 +18.30% 1% 68% 69% 511 18134525 +18.17% 1% 69% 70% Broadwell-EX: similar pattern as Skylake-EX. batch score change zone_contention lru_contention total_contention 31 16703983 +0.00% 67% 7% 74% 53 18195393 +8.93% 43% 28% 71% 63 18288885 +9.49% 38% 33% 71% 73 18344329 +9.82% 35% 37% 72% 119 18535529 +10.96% 24% 46% 70% 127 18513596 +10.83% 23% 48% 71% 137 18514327 +10.84% 23% 48% 71% 165 18511840 +10.82% 22% 49% 71% 188 18593478 +11.31% 17% 53% 70% 223 18601667 +11.36% 17% 52% 69% 255 18774825 +12.40% 12% 58% 70% 267 18754781 +12.28% 9% 60% 69% 299 18892265 +13.10% 7% 63% 70% 320 18873812 +12.99% 8% 62% 70% 393 18891174 +13.09% 6% 64% 70% 424 18975108 +13.60% 6% 64% 70% 458 18932364 +13.34% 8% 62% 70% 467 18960891 +13.51% 5% 65% 70% 498 18944526 +13.41% 5% 64% 69% 511 18960839 +13.51% 5% 64% 69% Skylake-EP: although increased batch reduced zone->lock contention, but the effect is not as good as EX: zone->lock contention is still as high as 20% with a very high batch value instead of 1% on Skylake-EX or 5% on Broadwell-EX. Also, total_contention actually decreased with a higher batch but that doesn't translate to performance increase. batch score change zone_contention lru_contention total_contention 31 9554867 +0.00% 66% 3% 69% 53 9855486 +3.15% 63% 3% 66% 63 9980145 +4.45% 62% 4% 66% 73 10092774 +5.63% 62% 5% 67% 119 10310061 +7.90% 45% 19% 64% 127 10342019 +8.24% 42% 19% 61% 137 10358182 +8.41% 42% 21% 63% 165 10397060 +8.81% 37% 24% 61% 188 10341808 +8.24% 34% 26% 60% 223 10349135 +8.31% 31% 27% 58% 255 10327189 +8.08% 28% 29% 57% 267 10344204 +8.26% 27% 29% 56% 299 10325043 +8.06% 25% 30% 55% 320 10310325 +7.91% 25% 31% 56% 393 10293274 +7.73% 21% 31% 52% 424 10311099 +7.91% 21% 32% 53% 458 10321375 +8.02% 21% 32% 53% 467 10303881 +7.84% 21% 32% 53% 498 10332462 +8.14% 20% 33% 53% 511 10325016 +8.06% 20% 32% 52% Broadwell-EP: zone->lock and lru lock had an agreement to make sure performance doesn't increase and they successfully managed to keep total contention at 70%. batch score change zone_contention lru_contention total_contention 31 10121178 +0.00% 19% 50% 69% 53 10142366 +0.21% 6% 63% 69% 63 10117984 -0.03% 11% 58% 69% 73 10123330 +0.02% 7% 63% 70% 119 10108791 -0.12% 2% 67% 69% 127 10166074 +0.44% 3% 66% 69% 137 10141574 +0.20% 3% 66% 69% 165 10154499 +0.33% 2% 68% 70% 188 10124921 +0.04% 2% 67% 69% 223 10137399 +0.16% 2% 67% 69% 255 10143289 +0.22% 0% 68% 68% 267 10123535 +0.02% 1% 68% 69% 299 10140952 +0.20% 0% 68% 68% 320 10163170 +0.41% 0% 68% 68% 393 10000633 -1.19% 0% 69% 69% 424 10087998 -0.33% 0% 69% 69% 458 10187116 +0.65% 0% 69% 69% 467 10146790 +0.25% 0% 69% 69% 498 10197958 +0.76% 0% 69% 69% 511 10152326 +0.31% 0% 69% 69% Haswell-EP: similar to Broadwell-EP. batch score change zone_contention lru_contention total_contention 31 10442205 +0.00% 14% 48% 62% 53 10442255 +0.00% 5% 57% 62% 63 10452059 +0.09% 6% 57% 63% 73 10482349 +0.38% 5% 59% 64% 119 10454644 +0.12% 3% 60% 63% 127 10431514 -0.10% 3% 59% 62% 137 10423785 -0.18% 3% 60% 63% 165 10481216 +0.37% 2% 61% 63% 188 10448755 +0.06% 2% 61% 63% 223 10467144 +0.24% 2% 61% 63% 255 10480215 +0.36% 2% 61% 63% 267 10484279 +0.40% 2% 61% 63% 299 10466450 +0.23% 2% 61% 63% 320 10452578 +0.10% 2% 61% 63% 393 10499678 +0.55% 1% 62% 63% 424 10481454 +0.38% 1% 62% 63% 458 10473562 +0.30% 1% 62% 63% 467 10484269 +0.40% 0% 62% 62% 498 10505599 +0.61% 0% 62% 62% 511 10483395 +0.39% 0% 62% 62% Westmere-EP: contention is pretty small so not interesting. Note too high a batch value could hurt performance. batch score change zone_contention lru_contention total_contention 31 4831523 +0.00% 2% 3% 5% 53 4834086 +0.05% 2% 4% 6% 63 4834262 +0.06% 2% 3% 5% 73 4832851 +0.03% 2% 4% 6% 119 4830534 -0.02% 1% 3% 4% 127 4827461 -0.08% 1% 4% 5% 137 4827459 -0.08% 1% 3% 4% 165 4820534 -0.23% 0% 4% 4% 188 4817947 -0.28% 0% 3% 3% 223 4809671 -0.45% 0% 3% 3% 255 4802463 -0.60% 0% 4% 4% 267 4801634 -0.62% 0% 3% 3% 299 4798047 -0.69% 0% 3% 3% 320 4793084 -0.80% 0% 3% 3% 393 4785877 -0.94% 0% 3% 3% 424 4782911 -1.01% 0% 3% 3% 458 4779346 -1.08% 0% 3% 3% 467 4780306 -1.06% 0% 3% 3% 498 4780589 -1.05% 0% 3% 3% 511 4773724 -1.20% 0% 3% 3% Skylake-Desktop: similar to Westmere-EP, nothing interesting. batch score change zone_contention lru_contention total_contention 31 3906608 +0.00% 2% 3% 5% 53 3940164 +0.86% 2% 3% 5% 63 3937289 +0.79% 2% 3% 5% 73 3940201 +0.86% 2% 3% 5% 119 3933240 +0.68% 2% 3% 5% 127 3930514 +0.61% 2% 4% 6% 137 3938639 +0.82% 0% 3% 3% 165 3908755 +0.05% 0% 3% 3% 188 3905621 -0.03% 0% 3% 3% 223 3903015 -0.09% 0% 4% 4% 255 3889480 -0.44% 0% 3% 3% 267 3891669 -0.38% 0% 4% 4% 299 3898728 -0.20% 0% 4% 4% 320 3894547 -0.31% 0% 4% 4% 393 3875137 -0.81% 0% 4% 4% 424 3874521 -0.82% 0% 3% 3% 458 3880432 -0.67% 0% 4% 4% 467 3888715 -0.46% 0% 3% 3% 498 3888633 -0.46% 0% 4% 4% 511 3875305 -0.80% 0% 5% 5% Haswell-Desktop: zone->lock is pretty low as other desktops, though lru contention is higher than other desktops. batch score change zone_contention lru_contention total_contention 31 3511158 +0.00% 2% 5% 7% 53 3555445 +1.26% 2% 6% 8% 63 3561082 +1.42% 2% 6% 8% 73 3547218 +1.03% 2% 6% 8% 119 3571319 +1.71% 1% 7% 8% 127 3549375 +1.09% 0% 6% 6% 137 3560233 +1.40% 0% 6% 6% 165 3555176 +1.25% 2% 6% 8% 188 3551501 +1.15% 0% 8% 8% 223 3531462 +0.58% 0% 7% 7% 255 3570400 +1.69% 0% 7% 7% 267 3532235 +0.60% 1% 8% 9% 299 3562326 +1.46% 0% 6% 6% 320 3553569 +1.21% 0% 8% 8% 393 3539519 +0.81% 0% 7% 7% 424 3549271 +1.09% 0% 8% 8% 458 3528885 +0.50% 0% 8% 8% 467 3526554 +0.44% 0% 7% 7% 498 3525302 +0.40% 0% 9% 9% 511 3527556 +0.47% 0% 8% 8% Sandybridge-Desktop: the 0% contention isn't accurate but caused by dropped fractional part. Since multiple contention path's contentions are all under 1% here, with some arithmetic operations like add, the final deviation could be as large as 3%. batch score change zone_contention lru_contention total_contention 31 1744495 +0.00% 0% 0% 0% 53 1755341 +0.62% 0% 0% 0% 63 1758469 +0.80% 0% 0% 0% 73 1759626 +0.87% 0% 0% 0% 119 1770417 +1.49% 0% 0% 0% 127 1768252 +1.36% 0% 0% 0% 137 1767848 +1.34% 0% 0% 0% 165 1765088 +1.18% 0% 0% 0% 188 1766918 +1.29% 0% 0% 0% 223 1767866 +1.34% 0% 0% 0% 255 1768074 +1.35% 0% 0% 0% 267 1763187 +1.07% 0% 0% 0% 299 1765620 +1.21% 0% 0% 0% 320 1767603 +1.32% 0% 0% 0% 393 1764612 +1.15% 0% 0% 0% 424 1758476 +0.80% 0% 0% 0% 458 1758593 +0.81% 0% 0% 0% 467 1757915 +0.77% 0% 0% 0% 498 1753363 +0.51% 0% 0% 0% 511 1755548 +0.63% 0% 0% 0% Phase two test results: Note: all percent change is against base(batch=31). ebizzy.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0% lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1% lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6% lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1% lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4% lkp-skl-d01 264126 262791 -0.5% 264113 +0.0% lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1% lkp-sb02 98937 98937 +0.0% 99030 +0.1% kbuild.buildtime (less is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1% lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1% lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1% lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4% lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1% lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7% lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1% netperf/TCP_STREAM.Throughput_total_Mbps (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0% lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1% lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1% lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1% lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1% lkp-skl-d01 77314 77303 -0.0% 77375 +0.1% lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7% lkp-sb02 29990 30137 +0.5% 30103 +0.4% oltp.transactions (higer is better) machine batch=31 batch=63 batch=127 lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4% lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0% lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7% lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4% lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0% lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0% lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8% pigz.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2% lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8% lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0% lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7% lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3% lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1% lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0% lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0% will-it-scale.malloc1.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0% lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0% lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3% lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0% lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6% lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0% lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2% lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6% lkp-sb02 589286 589284 -0.0% 588101 -0.2% will-it-scale.malloc1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2% lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4% lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9% lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9% lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2% lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1% lkp-skl-d01 175277 173343 -1.1% 173082 -1.3% lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8% lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4% will-it-scale.malloc2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0% lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0% lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1% lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1% lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1% lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1% lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0% lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0% lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0% will-it-scale.malloc2.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0% lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0% lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2% lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3% lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0% lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0% lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0% lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0% lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0% will-it-scale.page_fault2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0% lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0% lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3% lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9% lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7% lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5% lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9% lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6% lkp-sb02 842616 837844 -0.6% 833921 -1.0% will-it-scale.page_fault2.threads machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1% lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9% lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0% lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8% lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3% lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9% lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4% lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5% lkp-sb02 750626 748615 -0.3% 746621 -0.5% will-it-scale.page_fault3.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2% lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2% lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0% lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3% lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3% lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9% lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5% lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5% lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4% will-it-scale.page_fault3.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1% lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8% lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9% lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6% lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5% lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9% lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9% lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9% lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6% will-it-scale.read1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable) lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2% lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1% lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6% lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2% lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2% lkp-skl-d01 10969487 10972650 +0.0% no data lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8% lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5% will-it-scale.read1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable) lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3% lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6% lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3% lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1% lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7% lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1% lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1% lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0% will-it-scale.write1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0% lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4% lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1% lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9% lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7% lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9% lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8% lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4% lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6% will-it-scale.write1.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7% lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6% lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7% lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1% lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8% lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8% lkp-skl-d01 8346174 8235695 -1.3% no data lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3% lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0% vm-scalability.anon-r-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1% lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9% lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4% lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7% lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5% lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2% lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8% lkp-sb02 570572 567854 -0.5% 568449 -0.4% vm-scalability.anon-r-rand-mt.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1% lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9% lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6% lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9% lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0% lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0% lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1% lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9% lkp-sb02 567137 567346 +0.0% 566144 -0.2% vm-scalability.lru-file-mmap-read.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7% lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9% lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5% lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1% lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7% lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3% lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9% lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4% lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3% vm-scalability.lru-file-mmap-read-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6% lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2% lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0% lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1% lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1% lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1% lkp-skl-d01 696689 695537 -0.2% 694106 -0.4% lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2% lkp-sb02 258939 258787 -0.1% 258199 -0.3% Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
e13c88efb8
|
mm: introduce deactivate_page
perprocess reclaims needs to deactivate file pages from active LRU when echo file > /proc/<pid>/reclaim. Add deactivate_file pages. Bug: 131016077 Bug: 153444106 (cherry picked from b07eab27085611203ad359b7f4eecd138d7d771a) Change-Id: I06fed20103671e4ca6fb8663d5029736442162a5 Signed-off-by: Minchan Kim <minchan@google.com> Signed-off-by: Martin Liu <liumartin@google.com> Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
407183bd19
|
mm, compaction: raise compaction priority after it withdrawns
Mike Kravetz reports that "hugetlb allocations could stall for minutes or hours when should_compact_retry() would return true more often then it should. Specifically, this was in the case where compact_result was COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being made." The problem is that the compaction_withdrawn() test in should_compact_retry() includes compaction outcomes that are only possible on low compaction priority, and results in a retry without increasing the priority. This may result in furter reclaim, and more incomplete compaction attempts. With this patch, compaction priority is raised when possible, or should_compact_retry() returns false. The COMPACT_SKIPPED result doesn't really fit together with the other outcomes in compaction_withdrawn(), as that's a result caused by insufficient order-0 pages, not due to low compaction priority. With this patch, it is moved to a new compaction_needs_reclaim() function, and for that outcome we keep the current logic of retrying if it looks like reclaim will be able to help. Bug: 156785617 Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com Reported-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Minchan Kim <minchan@google.com> Change-Id: I67134003597caa963d5ecff7e2a42ef101e3aa4a Signed-off-by: Andrzej Perczak <linux@andrzejperczak.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
1e550e19b2
|
mm: damon: Optimize reclaim parameters
Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
6cad89a148
|
mm: damon: Enable DAMON-based reclaim by default
Change-Id: I23da7fed2549c5d8ba9b62874056e74a201186b3 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
1923f50f3d
|
mm/damon: reclaim: Adapt to downstream walk_system_ram_res()
Signed-off-by: Dark-Matter7232 <kerneldeveloper7232@gmail.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
0f028d37e8
|
mm/damon: fix build
Signed-off-by: Dark-Matter7232 <kerneldeveloper7232@gmail.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
171535f8a3
|
mm: damon: Avoid using mm_walk_ops API
- 4.19 kernel doesn't support mm_walk_ops structure so revert back to inlining mm_walk structures inside methods Change-Id: I87986a9e95468b76e4c68cb6b368599cc1494549 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
68823c43f1
|
mm: madvise: Avoid using mm_walk_ops API
- 4.19 kernel doesn't support mm_walk_ops structure so revert back to inlining mm_walk structures inside methods Change-Id: Id9061bde347d5c4050025c7a8fce673677bfa9f2 Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
88dc2cfd55
|
UPSTREAM: mm: fix trying to reclaim unevictable lru page when calling madvise_pageout
Recently, I hit the following issue when running upstream. kernel BUG at mm/vmscan.c:1521! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 Call Trace: reclaim_pages+0x499/0x800 mm/vmscan.c:2188 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 walk_pmd_range mm/pagewalk.c:53 [inline] walk_pud_range mm/pagewalk.c:112 [inline] walk_p4d_range mm/pagewalk.c:139 [inline] walk_pgd_range mm/pagewalk.c:166 [inline] __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 walk_page_range+0x179/0x310 mm/pagewalk.c:349 madvise_pageout_page_range mm/madvise.c:506 [inline] madvise_pageout+0x1f0/0x330 mm/madvise.c:542 madvise_vma mm/madvise.c:931 [inline] __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe madvise_pageout() accesses the specified range of the vma and isolates them, then runs shrink_page_list() to reclaim its memory. But it also isolates the unevictable pages to reclaim. Hence, we can catch the cases in shrink_page_list(). The root cause is that we scan the page tables instead of specific LRU list. and so we need to filter out the unevictable lru pages from our end. Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Change-Id: I84ecf1ed9f7d9de14df1798922183fb637d75adc Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
2be36de245
|
mm/madvise: reduce code duplication in error handling paths
madvise_behavior() converts -ENOMEM to -EAGAIN in several places using identical code. Move that code to a common error handling path. No functional changes. Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
df3a67460e
|
FROMLIST: mm/madvise: replace ptrace attach requirement for process_madvise
process_madvise currently requires ptrace attach capability. PTRACE_MODE_ATTACH gives one process complete control over another process. It effectively removes the security boundary between the two processes (in one direction). Granting ptrace attach capability even to a system process is considered dangerous since it creates an attack surface. This severely limits the usage of this API. The operations process_madvise can perform do not affect the correctness of the operation of the target process; they only affect where the data is physically located (and therefore, how fast it can be accessed). What we want is the ability for one process to influence another process in order to optimize performance across the entire system while leaving the security boundary intact. Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata and CAP_SYS_NICE for influencing process performance. Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Link: https://lore.kernel.org/lkml/202101111033.2D03EA97@keescook/T/#u Test: built and flashed kernel Bug: 153444106 Signed-off-by: Edgar Arriaga Garcia <edgararriaga@google.com> Change-Id: I3624a8b0697d70f23587c1dcb746ba753c301f45 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
5e6f2afbcf
|
UPSTREAM: mm/madvise: remove racy mm ownership check
Jann spotted the security hole due to race of mm ownership check. If the task is sharing the mm_struct but goes through execve() before mm_access(), it could skip process_madvise_behavior_valid check. That makes *any advice hint* to reach into the remote process. This patch removes the mm ownership check. With it, it will lose the ability that local process could give *any* advice hint with vector interface for some reason (e.g., performance). Since there is no concrete example in upstream yet, it would be better to remove the abiliity at this moment and need to review when such new advice comes up. Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Reported-by: Jann Horn <jannh@google.com> Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit a68a0262abdaa251e12c53715f48e698a18ef402) Bug: 153444106 Test: Built and flashed kernel Signed-off-by: Edgar Arriaga Garcia <edgararriaga@google.com> Change-Id: I49b1a581d1d6b651b46e0e7024cf61bce29578ba Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
a39ffe1406
|
FROMLIST: mm/madvise: fix memory leak from process_madvise
The eary return in process_madvise will produce memory leak. Fix it. Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Link: https://lore.kernel.org/linux-mm/20201116155132.GA3805951@google.com/ Test: Built and flashed kernel Signed-off-by: Edgar Arriaga Garcia <edgararriaga@google.com> Change-Id: Id1f7df48debe7fd55e51bf99257e1a2c6d97b285 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
e4493a4b48
|
BACKPORT: mm/madvise: introduce process_madvise() syscall: an external memory hinting API
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. It's similar in spirit to madvise(MADV_WONTNEED), but the information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. int process_madvise(int pidfd, void *addr, size_t length, int advise, unsigned long flag); Since it could affect other process's address range, only privileged process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: <linux-man@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit ecb8ac8b1f146915aa6b96449b66dd48984caacc) Conflicts: arch/alpha/kernel/syscalls/syscall.tbl arch/ia64/kernel/syscalls/syscall.tbl arch/m68k/kernel/syscalls/syscall.tbl arch/microblaze/kernel/syscalls/syscall.tbl arch/mips/kernel/syscalls/syscall_n32.tbl arch/mips/kernel/syscalls/syscall_n64.tbl arch/parisc/kernel/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl arch/sh/kernel/syscalls/syscall.tbl arch/sparc/kernel/syscalls/syscall.tbl mm/madvise.c 1. __NR_compat_syscalls in arch/arm64/include/asm/unistd.h modified to match latest version to avoid clobbering old number. 2. Dropped syscall.tbl, syscall_n32, syscall_n64 files for architectures not present in current kernel. 3. __NR_process_madvise in arch/arm64/include/asm/unistd32.h modified to match latest mm tree. 4. Added include for uio.h lib which is needed for UIO_FASTIOV and iovec Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: Icfff940abebcf290c3111239989ed40a407cf2a6 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
5478448d67
|
BACKPORT: mm/madvise: pass mm to do_madvise
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that, application could give hints to kernel what memory range are preferred to be reclaimed. However, in some platform(e.g., Android), the information required to make the hinting decision is not known to the app. Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the concern, this patch introduces new syscall - process_madvise(2). Bascially, it's same with madvise(2) syscall but it has some differences. 1. It needs pidfd of target process to provide the hint 2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment. Other hints in madvise will be opened when there are explicit requests from community to prevent unexpected bugs we couldn't support. 3. Only privileged processes can do something for other process's address space. For more detail of the new API, please see "mm: introduce external memory hinting API" description in this patchset. This patch (of 3): In upcoming patches, do_madvise will be called from external process context so we shouldn't asssume "current" is always hinted process's task_struct. Furthermore, we must not access mm_struct via task->mm, but obtain it via access_mm() once (in the following patch) and only use that pointer [1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers are safe, so we can use them further down the call stack. And let's pass current->mm as arguments of do_madvise so it shouldn't change existing behavior but prepare next patch to make review easy. [vbabka@suse.cz: changelog tweak] [minchan@kernel.org: use current->mm for io_uring] Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org [akpm@linux-foundation.org: fix it for upstream changes] [akpm@linux-foundation.org: whoops] [rdunlap@infradead.org: add missing includes] Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: John Dias <joaodias@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: Christian Brauner <christian@brauner.io> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 0726b01e70455f9900ab524117c7b520d197dc8c) Conflicts: fs/io_uring.c mm/madvise.c 1. fs/io_uring.c changes are not included because the file is missing in 4.14. 2. mm/madvise.c did not need additional includes 3. mm/madvise.c refactored to use mm instead of current->mm as that is what the patch changed. 4. mm/madvise.c Keep mmget_still_valid check that early outs from madvise if core dumping at the same time we try to madvise. 5. mm/madvise.c: did not add mm for madvise_willneed as it was not needed and causing to have a unused variable error when compiling. Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: I11b99220ae2a5e94cf46cb8b1f28fb109b4a25da Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
8b658cd540
|
UPSTREAM: mm: check that mm is still valid in madvise()
IORING_OP_MADVISE can end up basically doing mprotect() on the VM of another process, which means that it can race with our crazy core dump handling which accesses the VM state without holding the mmap_sem (because it incorrectly thinks that it is the final user). This is clearly a core dumping problem, but we've never fixed it the right way, and instead have the notion of "check that the mm is still ok" using mmget_still_valid() after getting the mmap_sem for writing in any situation where we're not the original VM thread. See commit 04f5866e41fb ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping") for more background on this whole mmget_still_valid() thing. You might want to have a barf bag handy when you do. We're discussing just fixing this properly in the only remaining core dumping routines. But even if we do that, let's make do_madvise() do the right thing, and then when we fix core dumping, we can remove all these mmget_still_valid() checks. Reported-and-tested-by: Jann Horn <jannh@google.com> Fixes: c1ca757bd6f4 ("io_uring: add IORING_OP_MADVISE") Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit bc0c4d1e176eeb614dc8734fc3ace34292771f11) Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: I9e300af00dd41d49be17abd545ac6572dbe4b797 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
07be66e42e
|
UPSTREAM: mm: do not allow MADV_PAGEOUT for CoW pages
Jann has brought up a very interesting point [1]. While shared pages are excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed that way. This can lead to all sorts of hard to debug problems. E.g. performance problems outlined by Daniel [2]. There are runtime environments where there is a substantial memory shared among security domains via CoW memory and a easy to reclaim way of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation in for the parent process which might be more privileged or even open side channel attacks. The feasibility of the latter is not really clear to me TBH but there is no real reason for exposure at this stage. It seems there is no real use case to depend on reclaiming CoW memory via madvise at this stage so it is much easier to simply disallow it and this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the exclusively owned memory which is a straightforward semantic. [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from 12e967fd8e4e6c3d275b4c69c890adc838891300) Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: I18d197e4b241405e6c8051cc7a5e7cbd3a1ee5b9 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
d77065c051
|
UPSTREAM: mm: validate pmd after splitting
syzbot reported the following KASAN splat: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296 Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 <80> 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006 Call Trace: lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389 walk_pmd_range mm/pagewalk.c:89 [inline] walk_pud_range mm/pagewalk.c:160 [inline] walk_p4d_range mm/pagewalk.c:193 [inline] walk_pgd_range mm/pagewalk.c:229 [inline] __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331 walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427 madvise_pageout_page_range mm/madvise.c:521 [inline] madvise_pageout mm/madvise.c:557 [inline] madvise_vma mm/madvise.c:946 [inline] do_madvise+0x12d0/0x2090 mm/madvise.c:1145 __do_sys_madvise mm/madvise.c:1171 [inline] __se_sys_madvise mm/madvise.c:1169 [inline] __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The backing vma was shmem. In case of split page of file-backed THP, madvise zaps the pmd instead of remapping of sub-pages. So we need to check pmd validity after split. Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit ce2684254bd4818ca3995c0d021fb62c4cf10a19) Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: I8ae135a797b82a333057a68c12df85a41c15747f Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
4b0678dbf5
|
UPSTREAM: mm: factor out common parts between MADV_COLD and MADV_PAGEOUT
There are many common parts between MADV_COLD and MADV_PAGEOUT. This patch factor them out to save code duplication. Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: kbuild test robot <lkp@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit d616d5126503967bf365db0711ee3c78b356efe9) Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: I125cdc8410d66d38907a9c74668306a32e7df55e Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
328f23e54e
|
UPSTREAM: mm: introduce MADV_PAGEOUT
When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall. MADV_PAGEOUT can be used by a process to mark a memory range as not expected to be used for a long time so that kernel reclaims *any LRU* pages instantly. The hint can help kernel in deciding which pages to evict proactively. A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit intentionally because it's automatically bounded by PMD size. If PMD size(e.g., 256) makes some trouble, we could fix it later by limit it to SWAP_CLUSTER_MAX[1]. - man-page material MADV_PAGEOUT (since Linux x.x) Do not expect access in the near future so pages in the specified regions could be reclaimed instantly regardless of memory pressure. Thus, access in the range after successful operation could cause major page fault but never lose the up-to-date contents unlike MADV_DONTNEED. Pages belonging to a shared mapping are only processed if a write access is allowed for the calling process. MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/ [minchan@kernel.org: clear PG_active on MADV_PAGEOUT] Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com (cherry-picked from commit 1a4e58cce84ee88129d5d49c064bd2852b481357) Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: Id400fe31150226684ffb6f37f399c4867490656e Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
083dbe75d0
|
BACKPORT: mm: introduce MADV_COLD
When a process expects no accesses to a certain memory range, it could give a hint to kernel that the pages can be reclaimed when memory pressure happens but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_COLD hint to madvise(2) syscall. MADV_COLD can be used by a process to mark a memory range as not expected to be used in the near future. The hint can help kernel in deciding which pages to evict early during memory pressure. It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves active file page -> inactive file LRU active anon page -> inacdtive anon LRU Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file LRU's head because MADV_COLD is a little bit different symantic. MADV_FREE means it's okay to discard when the memory pressure because the content of the page is *garbage* so freeing such pages is almost zero overhead since we don't need to swap out and access afterward causes just minor fault. Thus, it would make sense to put those freeable pages in inactive file LRU to compete other used-once pages. It makes sense for implmentaion point of view, too because it's not swapbacked memory any longer until it would be re-dirtied. Even, it could give a bonus to make them be reclaimed on swapless system. However, MADV_COLD doesn't mean garbage so reclaiming them requires swap-out/in in the end so it's bigger cost. Since we have designed VM LRU aging based on cost-model, anonymous cold pages would be better to position inactive anon's LRU list, not file LRU. Furthermore, it would help to avoid unnecessary scanning if system doesn't have a swap device. Let's start simpler way without adding complexity at this moment. However, keep in mind, too that it's a caveat that workloads with a lot of pages cache are likely to ignore MADV_COLD on anonymous memory because we rarely age anonymous LRU lists. * man-page material MADV_COLD (since Linux x.x) Pages in the specified regions will be treated as less-recently-accessed compared to pages in the system with similar access frequencies. In contrast to MADV_FREE, the contents of the region are preserved regardless of subsequent writes to pages. MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9c276cc65a58faf98be8e56962745ec99ab87636) Conflicts: include/asm-generic/tlb.h 1. Added tlb_change_page_size function from commit ed6a79352cad00e9a49d6e438be40e45107207bf which was required by this patch. 2. the lru_deactivate_pvecs and other swap.c changes are skipped as they already existed in the repo, they were backported in change-id I06fed20103671e4ca6fb8663d5029736442162a5 Bug: 153444106 Test: Built kernel Signed-off-by: Edgar Arriaga García <edgararriaga@google.com> Change-Id: I8f0f9d54e2f3d0ffe75c54f4db67e73f60083482 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
a884e01a78
|
UPSTREAM: mm: madvise: fix vma user-after-free
commit 7867fd7cc44e63c6673cd0f8fea155456d34d0de upstream. The syzbot reported the below use-after-free: BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline] BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline] BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145 Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996 CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 madvise_willneed mm/madvise.c:293 [inline] madvise_vma mm/madvise.c:942 [inline] do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145 do_madvise mm/madvise.c:1169 [inline] __do_sys_madvise mm/madvise.c:1171 [inline] __se_sys_madvise mm/madvise.c:1169 [inline] __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Allocated by task 9992: kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482 vm_area_alloc+0x1c/0x110 kernel/fork.c:347 mmap_region+0x8e5/0x1780 mm/mmap.c:1743 do_mmap+0xcf9/0x11d0 mm/mmap.c:1545 vm_mmap_pgoff+0x195/0x200 mm/util.c:506 ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 9992: kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693 remove_vma+0x132/0x170 mm/mmap.c:184 remove_vma_list mm/mmap.c:2613 [inline] __do_munmap+0x743/0x1170 mm/mmap.c:2869 do_munmap mm/mmap.c:2877 [inline] mmap_region+0x257/0x1780 mm/mmap.c:1716 do_mmap+0xcf9/0x11d0 mm/mmap.c:1545 vm_mmap_pgoff+0x195/0x200 mm/util.c:506 ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It is because vma is accessed after releasing mmap_lock, but someone else acquired the mmap_lock and the vma is gone. Releasing mmap_lock after accessing vma should fix the problem. Fixes: 692fe62433d4c ("mm: Handle MADV_WILLNEED through vfs_fadvise()") Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [5.4+] Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Change-Id: I65ea23f83f7128d20ddc178965ceb5948272c6c5 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
518347c3aa
|
UPSTREAM: mm: Handle MADV_WILLNEED through vfs_fadvise()
Currently handling of MADV_WILLNEED hint calls directly into readahead code. Handle it by calling vfs_fadvise() instead so that filesystem can use its ->fadvise() callback to acquire necessary locks or otherwise prepare for the request. Suggested-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Boaz Harrosh <boazh@netapp.com> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: UtsavBalar1231 <utsavbalar1231@gmail.com> Change-Id: I4c0f564139668b15595ce3d0a165c5cb756ef169 Signed-off-by: azrim <mirzaspc@gmail.com> |
||
|
60b5f5cc09
|
FROMLIST: mm/damon/reclaim: Fix the timer always stays active
The timer stays active even if the reclaim mechanism is never enabled. It is unnecessary overhead can be completely avoided by using module_param_cb() for enabled flag. Signed-off-by: Hailong Tu <tuhailong@gmail.com> Link: https://lore.kernel.org/all/20220421125910.1052459-1-tuhailong@gmail.com/ Bug: 228223814 Signed-off-by: Hailong Tu <tuhailong@oppo.com> Change-Id: I77591e41ce424ce16a4a5c70e7a86cdae996a354 Signed-off-by: azrim <mirzaspc@gmail.com> |