msm-4.14

mirror of https://github.com/rd-stuffs/msm-4.14.git synced 2025-02-20 11:45:48 +08:00

Author	SHA1	Message	Date
Richard Raya	b6eaf5c762	Revert "mm: move buddy list manipulations into helpers" This reverts commit be1968b3980cb973d3034605735ab12e1fa4672a. Change-Id: I203cdaca4f6fa23584a7b4c1ea15c73ff2137227 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-09-30 01:46:45 -03:00
Richard Raya	3143685e95	Merge branch 'linux-4.14.y' of https://github.com/openela/kernel-lts * 'linux-4.14.y' of https://github.com/openela/kernel-lts: (278 commits) LTS: Update to 4.14.348 docs: kernel_include.py: Cope with docutils 0.21 serial: kgdboc: Fix NMI-safety problems from keyboard reset code btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks() dm: limit the number of targets and parameter size area Revert "selftests: mm: fix map_hugetlb failure on 64K page size systems" LTS: Update to 4.14.347 rds: Fix build regression. RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats af_unix: Suppress false-positive lockdep splat for spin_lock() in __unix_gc(). net: fix out-of-bounds access in ops_init drm/vmwgfx: Fix invalid reads in fence signaled events dyndbg: fix old BUG_ON in >control parser tipc: fix UAF in error path usb: gadget: f_fs: Fix a race condition when processing setup packets. usb: gadget: composite: fix OS descriptors w_value logic firewire: nosy: ensure user_length is taken into account when fetching packet contents af_unix: Fix garbage collector racing against connect() af_unix: Do not use atomic ops for unix_sk(sk)->inflight. ipv6: fib6_rules: avoid possible NULL dereference in fib6_rule_action() ... Change-Id: If329d39dd4e95e14045bb7c58494c197d1352d60 Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>	2024-06-04 16:33:29 -03:00
Vlastimil Babka	ce55bbe342	mm, vmscan: prevent infinite loop for costly GFP_NOIO \| __GFP_RETRY_MAYFAIL allocations commit 803de9000f334b771afacb6ff3e78622916668b0 upstream. Sven reports an infinite loop in __alloc_pages_slowpath() for costly order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination can happen in a suspend/resume context where a GFP_KERNEL allocation can have __GFP_IO masked out via gfp_allowed_mask. Quoting Sven: 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with __GFP_RETRY_MAYFAIL set. 2. page alloc's __alloc_pages_slowpath tries to get a page from the freelist. This fails because there is nothing free of that costly order. 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which bails out because a zone is ready to be compacted; it pretends to have made a single page of progress. 4. page alloc tries to compact, but this always bails out early because __GFP_IO is not set (it's not passed by the snd allocator, and even if it were, we are suspending so the __GFP_IO flag would be cleared anyway). 5. page alloc believes reclaim progress was made (because of the pretense in item 3) and so it checks whether it should retry compaction. The compaction retry logic thinks it should try again, because: a) reclaim is needed because of the early bail-out in item 4 b) a zonelist is suitable for compaction 6. goto 2. indefinite stall. (end quote) The immediate root cause is confusing the COMPACT_SKIPPED returned from __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be indicating a lack of order-0 pages, and in step 5 evaluating that in should_compact_retry() as a reason to retry, before incrementing and limiting the number of retries. There are however other places that wrongly assume that compaction can happen while we lack __GFP_IO. To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO evaluation and switch the open-coded test in try_to_compact_pages() to use it. Also use the new helper in: - compaction_ready(), which will make reclaim not bail out in step 3, so there's at least one attempt to actually reclaim, even if chances are small for a costly order - in_reclaim_compaction() which will make should_continue_reclaim() return false and we don't over-reclaim unnecessarily - in __alloc_pages_slowpath() to set a local variable can_compact, which is then used to avoid retrying reclaim/compaction for costly allocations (step 5) if we can't compact and also to skip the early compaction attempt that we do in some cases Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sven van Ashbrook <svenva@chromium.org> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%2Bg@mail.gmail.com/ Tested-by: Karthikeyan Ramasubramanian <kramasub@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Curtis Malainey <cujomalainey@chromium.org> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Takashi Iwai <tiwai@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit c82a659cc8bb7a7f8a8348fc7f203c412ae3636f) Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>	2024-05-30 09:00:42 +00:00
Dan Williams	be1968b398	mm: move buddy list manipulations into helpers In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. [dan.j.williams@intel.com: fix buddy list helpers] Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter] Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: azrim <mirzaspc@gmail.com>	2022-08-25 14:41:21 +00:00
Yu Zhao	bb4a01c36e	BACKPORT: mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list() The parameter is redundant in the sense that it can be potentially extracted from the "struct page" parameter by page_lru(). We need to make sure that existing PageActive() or PageUnevictable() remains until the function returns. A few places don't conform, and simple reordering fixes them. This patch may have left page_off_lru() seemingly odd, and we'll take care of it in the next patch. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 46ae6b2cc2a47904a368d238425531ea91f3a2a5) Bug: 228114874 Change-Id: I1e14dcbf4111b39cf155ed3512423448865eb324 Signed-off-by: azrim <mirzaspc@gmail.com>	2022-07-01 09:15:23 +00:00
azrim	acf071c96b	Revert "BACKPORT: mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list()" This reverts commit ab4b57fe4682a6246055e00f8dca5b90ea37dce0.	2022-07-01 09:04:09 +00:00
Yu Zhao	ab4b57fe46	BACKPORT: mm/swap.c: don't pass "enum lru_list" to del_page_from_lru_list() The parameter is redundant in the sense that it can be potentially extracted from the "struct page" parameter by page_lru(). We need to make sure that existing PageActive() or PageUnevictable() remains until the function returns. A few places don't conform, and simple reordering fixes them. This patch may have left page_off_lru() seemingly odd, and we'll take care of it in the next patch. Link: https://lore.kernel.org/linux-mm/20201207220949.830352-6-yuzhao@google.com/ Link: https://lkml.kernel.org/r/20210122220600.906146-6-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 46ae6b2cc2a47904a368d238425531ea91f3a2a5) BUG=b:123039911 TEST=Built Change-Id: Iaf7a9c8c71da7d41c40f566ef9be8ac33c4e012d Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/2913975 Reviewed-by: Yu Zhao <yuzhao@chromium.org> Commit-Queue: Yu Zhao <yuzhao@chromium.org> Tested-by: Yu Zhao <yuzhao@chromium.org> Signed-off-by: azrim <mirzaspc@gmail.com>	2022-06-10 15:54:17 +07:00
Blagovest Kolenichev	fbbfccc5a5	Merge android-4.14-q.147 (5c8d069) into msm-4.14 * refs/heads/tmp-5c8d069: Revert "net: qrtr: Stop rx_worker before freeing node" Linux 4.14.147 Btrfs: fix race setting up and completing qgroup rescan workers btrfs: qgroup: Drop quota_root and fs_info parameters from update_qgroup_status_item mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone md/raid0: avoid RAID0 data corruption due to layout confusion. CIFS: Fix oplock handling for SMB 2.1+ protocols CIFS: fix max ea value size i2c: riic: Clear NACK in tend isr hwrng: core - don't wait on add_early_randomness() quota: fix wrong condition in is_quota_modification() ext4: fix punch hole for inline_data file systems ext4: fix warning inside ext4_convert_unwritten_extents_endio /dev/mem: Bail out upon SIGKILL. cfg80211: Purge frame registrations on iftype change md: only call set_in_sync() when it is expected to succeed. md: don't report active array_state until after revalidate_disk() completes. md/raid6: Set R5_ReadError when there is read failure on parity disk btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space btrfs: Relinquish CPUs in btrfs_compare_trees Btrfs: fix use-after-free when using the tree modification log ovl: filter of trusted xattr results in audit memcg, kmem: do not fail __GFP_NOFAIL charges memcg, oom: don't require __GFP_FS when invoking memcg OOM killer gfs2: clear buf_in_tr when ending a transaction in sweep_bh_for_rgrps regulator: Defer init completion for a while after late_initcall alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP arm64: dts: rockchip: limit clock rate of MMC controllers for RK3328 ARM: zynq: Use memcpy_toio instead of memcpy on smp bring-up ARM: samsung: Fix system restart on S3C6410 ASoC: Intel: Fix use of potentially uninitialized variable ASoC: Intel: Skylake: Use correct function to access iomem space ASoC: Intel: NHLT: Fix debug print format binfmt_elf: Do not move brk for INTERP-less ET_EXEC media: sn9c20x: Add MSI MS-1039 laptop to flip_dmi_table KVM: x86: Manually calculate reserved bits when loading PDPTRS KVM: x86: set ctxt->have_exception in x86_decode_insn() KVM: x86: always stop emulation on page fault x86/retpolines: Fix up backport of a9d57ef15cbe parisc: Disable HP HSC-PCI Cards to prevent kernel crash fuse: fix missing unlock_page in fuse_writepage() ALSA: hda/realtek - Fixup mute led on HP Spectre x360 randstruct: Check member structs in is_pure_ops_struct() IB/hfi1: Define variables as unsigned long to fix KASAN warning printk: Do not lose last line in kmsg buffer dump scsi: scsi_dh_rdac: zero cdb in send_mode_select() ALSA: firewire-tascam: check intermediate state of clock status and retry ALSA: firewire-tascam: handle error code when getting current source of clock PM / devfreq: passive: fix compiler warning media: omap3isp: Set device on omap3isp subdevs btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type ALSA: hda/realtek - Blacklist PC beep for Lenovo ThinkCentre M73/93 media: ttusb-dec: Fix info-leak in ttusb_dec_send_command() drm/amd/powerplay/smu7: enforce minimal VBITimeout (v2) ALSA: hda - Drop unsol event handler for Intel HDMI codecs e1000e: add workaround for possible stalled packet libertas: Add missing sentinel at end of if_usb.c fw_table raid5: don't increment read_errors on EILSEQ return mmc: sdhci: Fix incorrect switch to HS mode mmc: core: Clarify sdio_irq_pending flag for MMC_CAP2_SDIO_IRQ_NOTHREAD raid5: don't set STRIPE_HANDLE to stripe which is in batch list ASoC: dmaengine: Make the pcm->name equal to pcm->id if the name is not set s390/crypto: xts-aes-s390 fix extra run-time crypto self tests finding kprobes: Prohibit probing on BUG() and WARN() address dmaengine: ti: edma: Do not reset reserved paRAM slots md/raid1: fail run raid1 array when active disk less than one hwmon: (acpi_power_meter) Change log level for 'unsafe software power cap' ACPI / PCI: fix acpi_pci_irq_enable() memory leak ACPI: custom_method: fix memory leaks ARM: dts: exynos: Mark LDO10 as always-on on Peach Pit/Pi Chromebooks libtraceevent: Change users plugin directory iommu/iova: Avoid false sharing on fq_timer_on iommu/amd: Silence warnings under memory pressure nvmet: fix data units read and written counters in SMART log arm64: kpti: ensure patched kernel text is fetched from PoU ACPI / CPPC: do not require the _PSD method ASoC: es8316: fix headphone mixer volume table media: ov9650: add a sanity check perf trace beauty ioctl: Fix off-by-one error in cmd->string table media: saa7134: fix terminology around saa7134_i2c_eeprom_md7134_gate() media: cpia2_usb: fix memory leaks media: saa7146: add cleanup in hexium_attach() media: cec-notifier: clear cec_adap in cec_notifier_unregister PM / devfreq: exynos-bus: Correct clock enable sequence PM / devfreq: passive: Use non-devm notifiers EDAC/amd64: Decode syndrome before translating address EDAC/amd64: Recognize DRAM device type ECC capability libperf: Fix alignment trap with xyarray contents in 'perf stat' media: dvb-core: fix a memory leak bug nbd: add missing config put media: hdpvr: add terminating 0 at end of string media: radio/si470x: kill urb on error ARM: dts: imx7d: cl-som-imx7: make ethernet work again net: lpc-enet: fix printk format strings media: imx: mipi csi-2: Don't fail if initial state times-out media: omap3isp: Don't set streaming state on random subdevs media: i2c: ov5645: Fix power sequence perf record: Support aarch64 random socket_id assignment dmaengine: iop-adma: use correct printk format strings media: rc: imon: Allow iMON RC protocol for ffdc 7e device media: fdp1: Reduce FCP not found message level to debug media: mtk-mdp: fix reference count on old device tree perf test vfs_getname: Disable ~/.perfconfig to get default output media: gspca: zero usb_buf on error sched/fair: Use rq_lock/unlock in online_fair_sched_group efi: cper: print AER info of PCIe fatal error EDAC, pnd2: Fix ioremap() size in dnv_rd_reg() ACPI / processor: don't print errors for processorIDs == 0xff md: don't set In_sync if array is frozen md: don't call spare_active in md_reap_sync_thread if all member devices can't work md/raid1: end bio when the device faulty ASoC: rsnd: don't call clk_get_rate() under atomic context EDAC/altera: Use the proper type for the IRQ status bits ia64:unwind: fix double free for mod->arch.init_unw_table ALSA: usb-audio: Skip bSynchAddress endpoint check if it is invalid base: soc: Export soc_device_register/unregister APIs media: iguanair: add sanity checks EDAC/mc: Fix grain_bits calculation ALSA: i2c: ak4xxx-adda: Fix a possible null pointer dereference in build_adc_controls() ALSA: hda - Show the fatal CORB/RIRB error more clearly x86/apic: Soft disable APIC before initializing it x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI fails sched/core: Fix CPU controller for !RT_GROUP_SCHED sched/fair: Fix imbalance due to CPU affinity media: i2c: ov5640: Check for devm_gpiod_get_optional() error media: hdpvr: Add device num check and handling media: exynos4-is: fix leaked of_node references media: mtk-cir: lower de-glitch counter for rc-mm protocol media: dib0700: fix link error for dibx000_i2c_set_speed leds: leds-lp5562 allow firmware files up to the maximum length dmaengine: bcm2835: Print error in case setting DMA mask fails ASoC: sgtl5000: Fix charge pump source assignment regulator: lm363x: Fix off-by-one n_voltages for lm3632 ldo_vpos/ldo_vneg ALSA: hda: Flush interrupts on disabling nfc: enforce CAP_NET_RAW for raw sockets ieee802154: enforce CAP_NET_RAW for raw sockets ax25: enforce CAP_NET_RAW for raw sockets appletalk: enforce CAP_NET_RAW for raw sockets mISDN: enforce CAP_NET_RAW for raw sockets net/mlx5: Add device ID of upcoming BlueField-2 usbnet: sanity checking of packet sizes and device mtu usbnet: ignore endpoints with invalid wMaxPacketSize skge: fix checksum byte order sch_netem: fix a divide by zero in tabledist() ppp: Fix memory leak in ppp_write openvswitch: change type of UPCALL_PID attribute to NLA_UNSPEC net_sched: add max len check for TCA_KIND net/sched: act_sample: don't push mac header on ip6gre ingress net: qrtr: Stop rx_worker before freeing node net/phy: fix DP83865 10 Mbps HDX loopback disable function macsec: drop skb sk before calling gro_cells_receive cdc_ncm: fix divide-by-zero caused by invalid wMaxPacketSize arcnet: provide a buffer big enough to actually receive packets f2fs: use generic EFSBADCRC/EFSCORRUPTED Bluetooth: btrtl: Additional Realtek 8822CE Bluetooth devices xfs: don't crash on null attr fork xfs_bmapi_read ACPI: video: Add new hw_changes_brightness quirk, set it on PB Easynote MZ35 net: don't warn in inet diag when IPV6 is disabled drm: Flush output polling on shutdown f2fs: fix to do sanity check on segment bitmap of LFS curseg dm zoned: fix invalid memory access Revert "f2fs: avoid out-of-range memory access" blk-mq: move cancel of requeue_work to the front of blk_exit_queue PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it f2fs: check all the data segments against all node ones irqchip/gic-v3-its: Fix LPI release for Multi-MSI devices locking/lockdep: Add debug_locks check in __lock_downgrade() power: supply: sysfs: ratelimit property read error message pinctrl: sprd: Use define directive for sprd_pinconf_params values objtool: Clobber user CFLAGS variable ALSA: hda - Apply AMD controller workaround for Raven platform ALSA: hda - Add laptop imic fixup for ASUS M9V laptop arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field ASoC: fsl: Fix of-node refcount unbalance in fsl_ssi_probe_from_dt() media: tvp5150: fix switch exit in set control handler iwlwifi: mvm: send BCAST management frames to the right station crypto: talitos - fix missing break in switch statement mtd: cfi_cmdset_0002: Use chip_good() to retry in do_write_oneword() HID: hidraw: Fix invalid read in hidraw_ioctl HID: logitech: Fix general protection fault caused by Logitech driver HID: sony: Fix memory corruption issue on cleanup. HID: prodikeys: Fix general protection fault during probe IB/core: Add an unbound WQ type to the new CQ API objtool: Query pkg-config for libelf location powerpc/xive: Fix bogus error code returned by OPAL Revert "Bluetooth: validate BLE connection interval updates" Conflicts: drivers/mmc/core/sdio_irq.c fs/f2fs/data.c fs/f2fs/f2fs.h fs/f2fs/inode.c Change-Id: I757f54737e4d58319f2866f687a39123f0889e1e Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>	2019-10-11 05:19:38 -07:00
Greg Kroah-Hartman	5c8d069c5d	This is the 4.14.147 stable release -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl2YdO8ACgkQONu9yGCS aT50DBAAuLQNqoTz2C6HkEHqU6l9v491n8Aw6ILuUoILcoWGIOZEtdmKtRXXBWjR v0s1T4FbeOZkKHTJGXuCd8Di6BYBDBSSBRSvuvWFisCyo/nwM3i1bNVz+i0IBAbS Ug3mQOkCHfZgvht17XCDjcHgE2pdHThof5T9/3IIewfSyx4I0WZKlr4Xq1HF535g 30w1Ox+rWLQ4735LlGfMTd+oqYIaP91xyuFuGcsp2Pgy7nLFTtaa69fHD93qVH9o 2rdIowlYFU/uIkGx9qW/sMDq4Gp34xczdp/ABNiRylEL1Rf1cOkcn8KZX14ePodk 5x4hXHDkCWqquu6HnuTB6bzR7gwVsqxBEucT4v7wnkHTIFucEOeBc7E6T2llGTRe dfESrc28lXN3E9WGu7gAqi7Hvr2oDGVffthySwR6Yq4WoVSppHTc/SekZ4p2qhAl 8jp4V86U5Fwr6ERCwZ0LcQ8TUK1j9KptpJ1P4Lb/w4wT2csq8DasunDm8/7lYFfp ISa9OE4fF8bhSI45bP+o4WYac6x7F4A8RpdTGJ1qRp0crKDL7oKP5YtFzJTGyt2a FDnONywuYu2Iayt76fqYU8Lh7yDpxzLSY/66VXRAPcb4Xtc55BKjlRaoEqXOPXqB 8sid+r28LHYiOlHDfb+J/IKI8YyGAiB5ac7Pakw7Q8d07fsxiuI= =OgXW -----END PGP SIGNATURE----- Merge 4.14.147 into android-4.14-q Changes in 4.14.147 Revert "Bluetooth: validate BLE connection interval updates" powerpc/xive: Fix bogus error code returned by OPAL objtool: Query pkg-config for libelf location IB/core: Add an unbound WQ type to the new CQ API HID: prodikeys: Fix general protection fault during probe HID: sony: Fix memory corruption issue on cleanup. HID: logitech: Fix general protection fault caused by Logitech driver HID: hidraw: Fix invalid read in hidraw_ioctl mtd: cfi_cmdset_0002: Use chip_good() to retry in do_write_oneword() crypto: talitos - fix missing break in switch statement iwlwifi: mvm: send BCAST management frames to the right station media: tvp5150: fix switch exit in set control handler ASoC: fsl: Fix of-node refcount unbalance in fsl_ssi_probe_from_dt() arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field ALSA: hda - Add laptop imic fixup for ASUS M9V laptop ALSA: hda - Apply AMD controller workaround for Raven platform objtool: Clobber user CFLAGS variable pinctrl: sprd: Use define directive for sprd_pinconf_params values power: supply: sysfs: ratelimit property read error message locking/lockdep: Add debug_locks check in __lock_downgrade() irqchip/gic-v3-its: Fix LPI release for Multi-MSI devices f2fs: check all the data segments against all node ones PCI: hv: Avoid use of hv_pci_dev->pci_slot after freeing it blk-mq: move cancel of requeue_work to the front of blk_exit_queue Revert "f2fs: avoid out-of-range memory access" dm zoned: fix invalid memory access f2fs: fix to do sanity check on segment bitmap of LFS curseg drm: Flush output polling on shutdown net: don't warn in inet diag when IPV6 is disabled ACPI: video: Add new hw_changes_brightness quirk, set it on PB Easynote MZ35 xfs: don't crash on null attr fork xfs_bmapi_read Bluetooth: btrtl: Additional Realtek 8822CE Bluetooth devices f2fs: use generic EFSBADCRC/EFSCORRUPTED arcnet: provide a buffer big enough to actually receive packets cdc_ncm: fix divide-by-zero caused by invalid wMaxPacketSize macsec: drop skb sk before calling gro_cells_receive net/phy: fix DP83865 10 Mbps HDX loopback disable function net: qrtr: Stop rx_worker before freeing node net/sched: act_sample: don't push mac header on ip6gre ingress net_sched: add max len check for TCA_KIND openvswitch: change type of UPCALL_PID attribute to NLA_UNSPEC ppp: Fix memory leak in ppp_write sch_netem: fix a divide by zero in tabledist() skge: fix checksum byte order usbnet: ignore endpoints with invalid wMaxPacketSize usbnet: sanity checking of packet sizes and device mtu net/mlx5: Add device ID of upcoming BlueField-2 mISDN: enforce CAP_NET_RAW for raw sockets appletalk: enforce CAP_NET_RAW for raw sockets ax25: enforce CAP_NET_RAW for raw sockets ieee802154: enforce CAP_NET_RAW for raw sockets nfc: enforce CAP_NET_RAW for raw sockets ALSA: hda: Flush interrupts on disabling regulator: lm363x: Fix off-by-one n_voltages for lm3632 ldo_vpos/ldo_vneg ASoC: sgtl5000: Fix charge pump source assignment dmaengine: bcm2835: Print error in case setting DMA mask fails leds: leds-lp5562 allow firmware files up to the maximum length media: dib0700: fix link error for dibx000_i2c_set_speed media: mtk-cir: lower de-glitch counter for rc-mm protocol media: exynos4-is: fix leaked of_node references media: hdpvr: Add device num check and handling media: i2c: ov5640: Check for devm_gpiod_get_optional() error sched/fair: Fix imbalance due to CPU affinity sched/core: Fix CPU controller for !RT_GROUP_SCHED x86/reboot: Always use NMI fallback when shutdown via reboot vector IPI fails x86/apic: Soft disable APIC before initializing it ALSA: hda - Show the fatal CORB/RIRB error more clearly ALSA: i2c: ak4xxx-adda: Fix a possible null pointer dereference in build_adc_controls() EDAC/mc: Fix grain_bits calculation media: iguanair: add sanity checks base: soc: Export soc_device_register/unregister APIs ALSA: usb-audio: Skip bSynchAddress endpoint check if it is invalid ia64:unwind: fix double free for mod->arch.init_unw_table EDAC/altera: Use the proper type for the IRQ status bits ASoC: rsnd: don't call clk_get_rate() under atomic context md/raid1: end bio when the device faulty md: don't call spare_active in md_reap_sync_thread if all member devices can't work md: don't set In_sync if array is frozen ACPI / processor: don't print errors for processorIDs == 0xff EDAC, pnd2: Fix ioremap() size in dnv_rd_reg() efi: cper: print AER info of PCIe fatal error sched/fair: Use rq_lock/unlock in online_fair_sched_group media: gspca: zero usb_buf on error perf test vfs_getname: Disable ~/.perfconfig to get default output media: mtk-mdp: fix reference count on old device tree media: fdp1: Reduce FCP not found message level to debug media: rc: imon: Allow iMON RC protocol for ffdc 7e device dmaengine: iop-adma: use correct printk format strings perf record: Support aarch64 random socket_id assignment media: i2c: ov5645: Fix power sequence media: omap3isp: Don't set streaming state on random subdevs media: imx: mipi csi-2: Don't fail if initial state times-out net: lpc-enet: fix printk format strings ARM: dts: imx7d: cl-som-imx7: make ethernet work again media: radio/si470x: kill urb on error media: hdpvr: add terminating 0 at end of string nbd: add missing config put media: dvb-core: fix a memory leak bug libperf: Fix alignment trap with xyarray contents in 'perf stat' EDAC/amd64: Recognize DRAM device type ECC capability EDAC/amd64: Decode syndrome before translating address PM / devfreq: passive: Use non-devm notifiers PM / devfreq: exynos-bus: Correct clock enable sequence media: cec-notifier: clear cec_adap in cec_notifier_unregister media: saa7146: add cleanup in hexium_attach() media: cpia2_usb: fix memory leaks media: saa7134: fix terminology around saa7134_i2c_eeprom_md7134_gate() perf trace beauty ioctl: Fix off-by-one error in cmd->string table media: ov9650: add a sanity check ASoC: es8316: fix headphone mixer volume table ACPI / CPPC: do not require the _PSD method arm64: kpti: ensure patched kernel text is fetched from PoU nvmet: fix data units read and written counters in SMART log iommu/amd: Silence warnings under memory pressure iommu/iova: Avoid false sharing on fq_timer_on libtraceevent: Change users plugin directory ARM: dts: exynos: Mark LDO10 as always-on on Peach Pit/Pi Chromebooks ACPI: custom_method: fix memory leaks ACPI / PCI: fix acpi_pci_irq_enable() memory leak hwmon: (acpi_power_meter) Change log level for 'unsafe software power cap' md/raid1: fail run raid1 array when active disk less than one dmaengine: ti: edma: Do not reset reserved paRAM slots kprobes: Prohibit probing on BUG() and WARN() address s390/crypto: xts-aes-s390 fix extra run-time crypto self tests finding ASoC: dmaengine: Make the pcm->name equal to pcm->id if the name is not set raid5: don't set STRIPE_HANDLE to stripe which is in batch list mmc: core: Clarify sdio_irq_pending flag for MMC_CAP2_SDIO_IRQ_NOTHREAD mmc: sdhci: Fix incorrect switch to HS mode raid5: don't increment read_errors on EILSEQ return libertas: Add missing sentinel at end of if_usb.c fw_table e1000e: add workaround for possible stalled packet ALSA: hda - Drop unsol event handler for Intel HDMI codecs drm/amd/powerplay/smu7: enforce minimal VBITimeout (v2) media: ttusb-dec: Fix info-leak in ttusb_dec_send_command() ALSA: hda/realtek - Blacklist PC beep for Lenovo ThinkCentre M73/93 btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type media: omap3isp: Set device on omap3isp subdevs PM / devfreq: passive: fix compiler warning ALSA: firewire-tascam: handle error code when getting current source of clock ALSA: firewire-tascam: check intermediate state of clock status and retry scsi: scsi_dh_rdac: zero cdb in send_mode_select() printk: Do not lose last line in kmsg buffer dump IB/hfi1: Define variables as unsigned long to fix KASAN warning randstruct: Check member structs in is_pure_ops_struct() ALSA: hda/realtek - Fixup mute led on HP Spectre x360 fuse: fix missing unlock_page in fuse_writepage() parisc: Disable HP HSC-PCI Cards to prevent kernel crash x86/retpolines: Fix up backport of a9d57ef15cbe KVM: x86: always stop emulation on page fault KVM: x86: set ctxt->have_exception in x86_decode_insn() KVM: x86: Manually calculate reserved bits when loading PDPTRS media: sn9c20x: Add MSI MS-1039 laptop to flip_dmi_table binfmt_elf: Do not move brk for INTERP-less ET_EXEC ASoC: Intel: NHLT: Fix debug print format ASoC: Intel: Skylake: Use correct function to access iomem space ASoC: Intel: Fix use of potentially uninitialized variable ARM: samsung: Fix system restart on S3C6410 ARM: zynq: Use memcpy_toio instead of memcpy on smp bring-up arm64: dts: rockchip: limit clock rate of MMC controllers for RK3328 alarmtimer: Use EOPNOTSUPP instead of ENOTSUPP regulator: Defer init completion for a while after late_initcall gfs2: clear buf_in_tr when ending a transaction in sweep_bh_for_rgrps memcg, oom: don't require __GFP_FS when invoking memcg OOM killer memcg, kmem: do not fail __GFP_NOFAIL charges ovl: filter of trusted xattr results in audit Btrfs: fix use-after-free when using the tree modification log btrfs: Relinquish CPUs in btrfs_compare_trees btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space md/raid6: Set R5_ReadError when there is read failure on parity disk md: don't report active array_state until after revalidate_disk() completes. md: only call set_in_sync() when it is expected to succeed. cfg80211: Purge frame registrations on iftype change /dev/mem: Bail out upon SIGKILL. ext4: fix warning inside ext4_convert_unwritten_extents_endio ext4: fix punch hole for inline_data file systems quota: fix wrong condition in is_quota_modification() hwrng: core - don't wait on add_early_randomness() i2c: riic: Clear NACK in tend isr CIFS: fix max ea value size CIFS: Fix oplock handling for SMB 2.1+ protocols md/raid0: avoid RAID0 data corruption due to layout confusion. mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone btrfs: qgroup: Drop quota_root and fs_info parameters from update_qgroup_status_item Btrfs: fix race setting up and completing qgroup rescan workers Linux 4.14.147 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>	2019-10-06 12:38:20 +02:00
Yafang Shao	0861bcab4f	mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone [ Upstream commit a94b525241c0fff3598809131d7cfcfe1d572d8c ] total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and COMPACTFREE_SCANNED in compact_zone(). We should clear them before scanning a new zone. In the proc triggered compaction, we forgot clearing them. [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()] Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko] [vbabka@suse.cz: squash compact_zone() list_head init as well] Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone] Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com Fixes: 7f354a548d1c ("mm, compaction: add vmstats for kcompactd work") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Yafang Shao <shaoyafang@didiglobal.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>	2019-10-05 12:48:13 +02:00
Blagovest Kolenichev	070370f0ae	Merge android-4.14.108 (4344de2) into msm-4.14 * refs/heads/tmp-4344de2: Linux 4.14.108 s390/setup: fix boot crash for machine without EDAT-1 KVM: nVMX: Ignore limit checks on VMX instructions using flat segments KVM: nVMX: Apply addr size mask to effective address for VMX instructions KVM: nVMX: Sign extend displacements of VMX instr's mem operands KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux KVM: x86/mmu: Detect MMIO generation wrap in any address space KVM: Call kvm_arch_memslots_updated() before updating memslots drm/radeon/evergreen_cs: fix missing break in switch statement media: imx: csi: Stop upstream before disabling IDMA channel media: imx: csi: Disable CSI immediately after last EOF media: vimc: Add vimc-streamer for stream control media: uvcvideo: Avoid NULL pointer dereference at the end of streaming media: imx: prpencvf: Stop upstream before disabling IDMA channel rcu: Do RCU GP kthread self-wakeup from softirq and interrupt tpm: Unify the send callback behaviour tpm/tpm_crb: Avoid unaligned reads in crb_recv() md: Fix failed allocation of md_register_thread perf intel-pt: Fix divide by zero when TSC is not available perf intel-pt: Fix overlap calculation for padding perf auxtrace: Define auxtrace record alignment perf intel-pt: Fix CYC timestamp calculation after OVF x86/unwind/orc: Fix ORC unwind table alignment bcache: never writeback a discard operation PM / wakeup: Rework wakeup source timer cancellation NFSv4.1: Reinitialise sequence results before retransmitting a request nfsd: fix wrong check in write_v4_end_grace() nfsd: fix memory corruption caused by readdir NFS: Don't recoalesce on error in nfs_pageio_complete_mirror() NFS: Fix an I/O request leakage in nfs_do_recoalesce NFS: Fix I/O request leakages cpcap-charger: generate events for userspace dm integrity: limit the rate of error messages dm: fix to_sector() for 32bit arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2 arm64: debug: Ensure debug handlers check triggering exception level arm64: Fix HCR.TGE status for NMI contexts ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify powerpc/traps: Fix the message printed when stack overflows powerpc/traps: fix recoverability of machine check handling on book3s/32 powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest powerpc/83xx: Also save/restore SPRG4-7 during suspend powerpc/powernv: Make opal log only readable by root powerpc/wii: properly disable use of BATs when requested. powerpc/32: Clear on-stack exception marker upon exception return security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock jbd2: fix compile warning when using JBUFFER_TRACE jbd2: clear dirty flag when revoking a buffer from an older transaction serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup() serial: 8250_pci: Fix number of ports for ACCES serial cards serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO drm/i915: Relax mmap VMA check crypto: arm64/aes-neonbs - fix returning final keystream block i2c: tegra: fix maximum transfer size parport_pc: fix find_superio io compare code, should use equal test. intel_th: Don't reference unassigned outputs device property: Fix the length used in PROPERTY_ENTRY_STRING() kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv mm/vmalloc: fix size check for remap_vmalloc_range_partial() mm: hwpoison: fix thp split handing in soft_offline_in_use_page() nfit: acpi_nfit_ctl(): Check out_obj->type in the right place usb: chipidea: tegra: Fix missed ci_hdrc_remove_device() clk: ingenic: Fix doc of ingenic_cgu_div_info clk: ingenic: Fix round_rate misbehaving with non-integer dividers clk: clk-twl6040: Fix imprecise external abort for pdmclk clk: uniphier: Fix update register for CPU-gear ext2: Fix underflow in ext2_max_size() cxl: Wrap iterations over afu slices inside 'afu_list_lock' IB/hfi1: Close race condition on user context disable and close ext4: fix crash during online resizing ext4: add mask of ext4 flags to swap cpufreq: pxa2xx: remove incorrect __init annotation cpufreq: tegra124: add missing of_node_put() x86/kprobes: Prohibit probing on optprobe template code irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer crypto: pcbc - remove bogus memcpy()s with src == dest Btrfs: fix corruption reading shared and compressed extents after hole punching btrfs: ensure that a DUP or RAID1 block group has exactly two stripes Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl m68k: Add -ffreestanding to CFLAGS splice: don't merge into linked buffers fs/devpts: always delete dcache dentry-s in dput() scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock scsi: sd: Optimal I/O size should be a multiple of physical block size scsi: aacraid: Fix performance issue on logical drives scsi: virtio_scsi: don't send sc payload with tmfs s390/virtio: handle find on invalid queue gracefully s390/setup: fix early warning messages clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR regulator: s2mpa01: Fix step values for some LDOs regulator: max77620: Initialize values for DT properties regulator: s2mps11: Fix steps for buck7, buck8 and LDO35 spi: pxa2xx: Setup maximum supported DMA transfer length spi: ti-qspi: Fix mmap read when more than one CS in use mmc: sdhci-esdhc-imx: fix HS400 timing issue ACPI / device_sysfs: Avoid OF modalias creation for removed device xen: fix dom0 boot on huge systems tracing: Do not free iter->trace in fail path of tracing_open_pipe() tracing: Use strncpy instead of memcpy for string keys in hist triggers CIFS: Fix read after write for files with read caching CIFS: Do not reset lease state to NONE on lease break crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling crypto: testmgr - skip crc32c context test for ahash algorithms crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails crypto: arm64/crct10dif - revert to C code for short inputs crypto: arm/crct10dif - revert to C code for short inputs fix cgroup_do_mount() handling of failure exits libnvdimm: Fix altmap reservation size calculation libnvdimm/pmem: Honor force_raw for legacy pmem regions libnvdimm, pfn: Fix over-trim in trim_pfn_device() libnvdimm/label: Clear 'updating' flag after label-set update stm class: Prevent division by zero media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused() tmpfs: fix uninitialized return value in shmem_link net: set static variable an initial value in atl2_probe() nfp: bpf: fix ALU32 high bits clearance bug nfp: bpf: fix code-gen bug on BPF_ALU \| BPF_XOR \| BPF_K net: thunderx: make CFG_DONE message to run through generic send-ack sequence mac80211_hwsim: propagate genlmsg_reply return code phonet: fix building with clang ARCv2: support manual regfile save on interrupts ARC: uacces: remove lp_start, lp_end from clobber list ARCv2: lib: memcpy: fix doing prefetchw outside of buffer ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN tmpfs: fix link accounting when a tmpfile is linked in net: marvell: mvneta: fix DMA debug warning arm64: Relax GIC version check during early boot qed: Fix iWARP syn packet mac address validation. ASoC: topology: free created components in tplg load error mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe() qmi_wwan: apply SET_DTR quirk to Sierra WP7607 pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins net: systemport: Fix reception of BPDUs scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task keys: Fix dependency loop between construction record and auth key assoc_array: Fix shortcut creation af_key: unconditionally clone on broadcast ARM: 8824/1: fix a migrating irq bug when hotplug cpu esp: Skip TX bytes accounting when sending from a request socket clk: sunxi: A31: Fix wrong AHB gate number clk: sunxi-ng: v3s: Fix TCON reset de-assert bit Input: st-keyscan - fix potential zalloc NULL dereference auxdisplay: ht16k33: fix potential user-after-free on module unload i2c: bcm2835: Clear current buffer pointers and counts after a transfer i2c: cadence: Fix the hold bit setting net: hns: Fix object reference leaks in hns_dsaf_roce_reset() mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs Revert "mm: use early_pfn_to_nid in page_ext_init" mm/gup: fix gup_pmd_range() for dax NFS: Don't use page_file_mapping after removing the page floppy: check_events callback should not return a negative number ipvs: fix dependency on nf_defrag_ipv6 mac80211: Fix Tx aggregation session tear down with ITXQs Input: matrix_keypad - use flush_delayed_work() Input: ps2-gpio - flush TX work when closing port Input: cap11xx - switch to using set_brightness_blocking() ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized Input: pwm-vibra - stop regulator after disabling pwm, not before Input: pwm-vibra - prevent unbalanced regulator s390/dasd: fix using offset into zero size array error gpu: ipu-v3: Fix CSI offsets for imx53 drm/imx: imx-ldb: add missing of_node_puts gpu: ipu-v3: Fix i.MX51 CSI control registers offset drm/imx: ignore plane updates on disabled crtcs crypto: rockchip - update new iv to device in multiple operations crypto: rockchip - fix scatterlist nents error crypto: ahash - fix another early termination in hash walk crypto: caam - fixed handling of sg list stm class: Fix an endless loop in channel allocation iio: adc: exynos-adc: Fix NULL pointer exception on unbind ASoC: fsl_esai: fix register setting issue in RIGHT_J mode 9p/net: fix memory leak in p9_client_create 9p: use inode->i_lock to protect i_size_write() under 32-bit FROMLIST: psi: introduce psi monitor FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h FROMLIST: psi: track changed states FROMLIST: psi: split update_stats into parts FROMLIST: psi: rename psi fields in preparation for psi trigger addition FROMLIST: psi: make psi_enable static FROMLIST: psi: introduce state_mask to represent stalled psi states ANDROID: cuttlefish_defconfig: Enable CONFIG_INPUT_MOUSEDEV ANDROID: cuttlefish_defconfig: Enable CONFIG_PSI BACKPORT: kernel: cgroup: add poll file operation BACKPORT: fs: kernfs: add poll file operation UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines UPSTREAM: psi: clarify the Kconfig text for the default-disable option UPSTREAM: psi: fix aggregation idle shut-off UPSTREAM: psi: fix reference to kernel commandline enable UPSTREAM: psi: make disabling/enabling easier for vendor kernels UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task() BACKPORT: psi: cgroup support UPSTREAM: psi: pressure stall information for CPU, memory, and IO UPSTREAM: sched: introduce this_rq_lock_irq() UPSTREAM: sched: sched.h: make rq locking and clock functions available in stats.h UPSTREAM: sched: loadavg: make calc_load_n() public BACKPORT: sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD UPSTREAM: delayacct: track delays from thrashing cache pages UPSTREAM: mm: workingset: tell cache transitions from workingset thrashing sched/fair: fix energy compute when a cluster is only a cpu core in multi-cluster system Conflicts: arch/arm/kernel/irq.c drivers/scsi/sd.c include/linux/sched.h include/uapi/linux/taskstats.h kernel/sched/Makefile sound/soc/soc-dapm.c Change-Id: I12ebb57a34da9101ee19458d7e1f96ecc769c39a Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>	2019-05-15 07:44:57 -07:00
Johannes Weiner	94eab674ff	UPSTREAM: psi: pressure stall information for CPU, memory, and IO When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2) Bug: 127712811 Test: lmkd in PSI mode Change-Id: Id00d23c977169b0c4636d92016fc1fee0274be05 Signed-off-by: Suren Baghdasaryan <surenb@google.com>	2019-03-21 16:19:55 -07:00
Vinayak Menon	6093c334a7	mm: compaction: fix the page state calculation in too_many_isolated Commit "mm: vmscan: fix the page state calculation in too_many_isolated" fixed an issue where a number of tasks were blocked in reclaim path for seconds, because of vmstat_diff not being synced in time. A similar problem can happen in isolate_migratepages_block, where similar calculation is performed. This patch fixes that. Change-Id: Ie74f108ef770da688017b515fe37faea6f384589 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Charan Teja Reddy <charante@codeaurora.org>	2018-05-10 23:32:47 -07:00
Greg Kroah-Hartman	b24413180f	License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a /uapi/ one with no licensing information in it, - file was a /uapi/ one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non /uapi/ files that summary was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a /uapi/ path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------\|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the /uapi/ ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------\|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-11-02 11:10:55 +01:00
Davidlohr Bueso	6818600ff0	mm,compaction: serialize waitqueue_active() checks (for real) Andrea brought to my attention that the L->{L,S} guarantees are completely bogus for this case. I was looking at the diagram, from the offending commit, when that _is_ the race, we had the load reordered already. What we need is at least S->L semantics, thus simply use wq_has_sleeper() to serialize the call for good. Link: http://lkml.kernel.org/r/20170914175313.GB811@linux-80c1.suse Fixes: 46acef048a6 (mm,compaction: serialize waitqueue_active() checks) Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Andrea Parri <parri.andrea@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-10-03 17:54:24 -07:00
Michal Hocko	ccbe1e4dde	mm, compaction: skip over holes in __reset_isolation_suitable __reset_isolation_suitable walks the whole zone pfn range and it tries to jump over holes by checking the zone for each page. It might still stumble over offline pages, though. Skip those by checking pfn_to_online_page() Link: http://lkml.kernel.org/r/20170515085827.16474-9-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Kiper <daniel.kiper@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Tobias Regnery <tobias.regnery@gmail.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-07-06 16:24:32 -07:00
Vlastimil Babka	baf6a9a1db	mm, compaction: finish whole pageblock to reduce fragmentation The main goal of direct compaction is to form a high-order page for allocation, but it should also help against long-term fragmentation when possible. Most lower-than-pageblock-order compactions are for non-movable allocations, which means that if we compact in a movable pageblock and terminate as soon as we create the high-order page, it's unlikely that the fallback heuristics will claim the whole block. Instead there might be a single unmovable page in a pageblock full of movable pages, and the next unmovable allocation might pick another pageblock and increase long-term fragmentation. To help against such scenarios, this patch changes the termination criteria for compaction so that the current pageblock is finished even though the high-order page already exists. Note that it might be possible that the high-order page formed elsewhere in the zone due to parallel activity, but this patch doesn't try to detect that. This is only done with sync compaction, because async compaction is limited to pageblock of the same migratetype, where it cannot result in a migratetype fallback. (Async compaction also eagerly skips order-aligned blocks where isolation fails, which is against the goal of migrating away as much of the pageblock as possible.) As a result of this patch, long-term memory fragmentation should be reduced. In testing based on 4.9 kernel with stress-highalloc from mmtests configured for order-4 GFP_KERNEL allocations, this patch has reduced the number of unmovable allocations falling back to movable pageblocks by 20%. The number Link: http://lkml.kernel.org/r/20170307131545.28577-9-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:10 -07:00
Vlastimil Babka	282722b0d2	mm, compaction: restrict async compaction to pageblocks of same migratetype The migrate scanner in async compaction is currently limited to MIGRATE_MOVABLE pageblocks. This is a heuristic intended to reduce latency, based on the assumption that non-MOVABLE pageblocks are unlikely to contain movable pages. However, with the exception of THP's, most high-order allocations are not movable. Should the async compaction succeed, this increases the chance that the non-MOVABLE allocations will fallback to a MOVABLE pageblock, making the long-term fragmentation worse. This patch attempts to help the situation by changing async direct compaction so that the migrate scanner only scans the pageblocks of the requested migratetype. If it's a non-MOVABLE type and there are such pageblocks that do contain movable pages, chances are that the allocation can succeed within one of such pageblocks, removing the need for a fallback. If that fails, the subsequent sync attempt will ignore this restriction. In testing based on 4.9 kernel with stress-highalloc from mmtests configured for order-4 GFP_KERNEL allocations, this patch has reduced the number of unmovable allocations falling back to movable pageblocks by 30%. The number of movable allocations falling back is reduced by 12%. Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:10 -07:00
Vlastimil Babka	d39773a062	mm, compaction: add migratetype to compact_control Preparation patch. We are going to need migratetype at lower layers than compact_zone() and compact_finished(). Link: http://lkml.kernel.org/r/20170307131545.28577-7-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:10 -07:00
Vlastimil Babka	b682debd97	mm, compaction: change migrate_async_suitable() to suitable_migration_source() Preparation for making the decisions more complex and depending on compact_control flags. No functional change. Link: http://lkml.kernel.org/r/20170307131545.28577-6-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:10 -07:00
Vlastimil Babka	228d7e3390	mm, compaction: remove redundant watermark check in compact_finished() When detecting whether compaction has succeeded in forming a high-order page, __compact_finished() employs a watermark check, followed by an own search for a suitable page in the freelists. This is not ideal for two reasons: - The watermark check also searches high-order freelists, but has a less strict criteria wrt fallback. It's therefore redundant and waste of cycles. This was different in the past when high-order watermark check attempted to apply reserves to high-order pages. - The watermark check might actually fail due to lack of order-0 pages. Compaction can't help with that, so there's no point in continuing because of that. It's possible that high-order page still exists and it terminates. This patch therefore removes the watermark check. This should save some cycles and terminate compaction sooner in some cases. Link: http://lkml.kernel.org/r/20170307131545.28577-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-08 17:15:09 -07:00
Yisheng Xie	1ef36db2a9	mm/compaction: ignore block suitable after check large free page By reviewing code, I find that if the migrate target is a large free page and we ignore suitable, it may splite large target free page into smaller block which is not good for defrag. So move the ignore block suitable after check large free page. As Vlastimil pointed out in RFC version that this patch is just based on logical analyses which might be better for future-proofing the function and it is most likely won't have any visible effect right now, for direct compaction shouldn't have to be called if there's a >=pageblock_order page already available. Link: http://lkml.kernel.org/r/1489490743-5364-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-05-03 15:52:09 -07:00
Ingo Molnar	174cd4b1e5	sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> Fix up affected files that include this signal functionality via sched.h. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:32 +01:00
Yisheng Xie	9e5bcd610f	mm/migration: make isolate_movable_page() return int type Patch series "HWPOISON: soft offlining for non-lru movable page", v6. After Minchan's commit bda807d44454 ("mm: migrate: support non-lru movable page migration"), some type of non-lru page like zsmalloc and virtio-balloon page also support migration. Therefore, we can: 1) soft offlining no-lru movable pages, which means when memory corrected errors occur on a non-lru movable page, we can stop to use it by migrating data onto another page and disable the original (maybe half-broken) one. 2) enable memory hotplug for non-lru movable pages, i.e. we may offline blocks, which include such pages, by using non-lru page migration. This patchset is heavily dependent on non-lru movable page migration. This patch (of 4): Change the return type of isolate_movable_page() from bool to int. It will return 0 when isolate movable page successfully, and return -EBUSY when it isolates failed. There is no functional change within this patch but prepare for later patch. [xieyisheng1@huawei.com: v6] Link: http://lkml.kernel.org/r/1486108770-630-2-git-send-email-xieyisheng1@huawei.com Link: http://lkml.kernel.org/r/1485867981-16037-2-git-send-email-ysxie@foxmail.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-24 17:46:55 -08:00
Davidlohr Bueso	46acef048a	mm,compaction: serialize waitqueue_active() checks Without a memory barrier, the following race can occur with a high-order allocation: wakeup_kcompactd(order == 1) kcompactd() [L] waitqueue_active(kcompactd_wait) [S] prepare_to_wait_event(kcompactd_wait) [L] (kcompactd_max_order == 0) [S] kcompactd_max_order = order; schedule() Where the waitqueue_active() check is speculatively re-ordered to before setting the actual condition (max_order), not seeing the threads that's going to block; making us miss a wakeup. There are a couple of options to fix this, including calling wq_has_sleepers() which adds a full barrier, or unconditionally doing the wake_up_interruptible() and serialize on the q->lock. However, to make use of the control dependency, we just need to add L->L guarantees. While this bug is theoretical, there have been other offenders of the lockless waitqueue_active() in the past -- this is also documented in the call itself. Link: http://lkml.kernel.org/r/1483975528-24342-1-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:29 -08:00
David Rientjes	7f354a548d	mm, compaction: add vmstats for kcompactd work A "compact_daemon_wake" vmstat exists that represents the number of times kcompactd has woken up. This doesn't represent how much work it actually did, though. It's useful to understand how much compaction work is being done by kcompactd versus other methods such as direct compaction and explicitly triggered per-node (or system) compaction. This adds two new vmstats: "compact_daemon_migrate_scanned" and "compact_daemon_free_scanned" to represent the number of pages kcompactd has scanned as part of its migration scanner and freeing scanner, respectively. These values are still accounted for in the general "compact_migrate_scanned" and "compact_free_scanned" for compatibility. It could be argued that explicitly triggered compaction could also be tracked separately, and that could be added if others find it useful. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612071749390.69852@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-02-22 16:41:29 -08:00
Michal Hocko	73e64c51af	mm, compaction: allow compaction for GFP_NOFS requests compaction has been disabled for GFP_NOFS and GFP_NOIO requests since the direct compaction was introduced by commit 56de7263fcf3 ("mm: compaction: direct compact when a high-order allocation fails"). The main reason is that the migration of page cache pages might recurse back to fs/io layer and we could potentially deadlock. This is overly conservative because all the anonymous memory is migrateable in the GFP_NOFS context just fine. This might be a large portion of the memory in many/most workkloads. Remove the GFP_NOFS restriction and make sure that we skip all fs pages (those with a mapping) while isolating pages to be migrated. We cannot consider clean fs pages because they might need a metadata update so only isolate pages without any mapping for nofs requests. The effect of this patch will be probably very limited in many/most workloads because higher order GFP_NOFS requests are quite rare, although different configurations might lead to very different results. David Chinner has mentioned a heavy metadata workload with 64kB block which to quote him: : Unfortunately, there was an era of cargo cult configuration tweaks in the : Ceph community that has resulted in a large number of production machines : with XFS filesystems configured this way. And a lot of them store large : numbers of small files and run under significant sustained memory : pressure. : : I slowly working towards getting rid of these high order allocations and : replacing them with the equivalent number of single page allocations, but : I haven't got that (complex) change working yet. We can do the following to simulate that workload: $ mkfs.xfs -f -n size=64k <dev> $ mount <dev> /mnt/scratch $ time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \ -d /mnt/scratch/0 -d /mnt/scratch/1 \ -d /mnt/scratch/2 -d /mnt/scratch/3 \ -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 \ -d /mnt/scratch/8 -d /mnt/scratch/9 \ -d /mnt/scratch/10 -d /mnt/scratch/11 \ -d /mnt/scratch/12 -d /mnt/scratch/13 \ -d /mnt/scratch/14 -d /mnt/scratch/15 and indeed is hammers the system with many high order GFP_NOFS requests as per a simle tracepoint during the load: $ echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter I am getting 5287609 order=0 37 order=1 1594905 order=2 3048439 order=3 6699207 order=4 66645 order=5 My testing was done in a kvm guest so performance numbers should be taken with a grain of salt but there seems to be a difference when the patch is applied: * Original kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4300.1 20745838 3 3200000 0 4239.9 23849857 5 4800000 0 4243.4 25939543 6 6400000 0 4248.4 19514050 8 8000000 0 4262.1 20796169 9 9600000 0 4257.6 21288675 11 11200000 0 4259.7 19375120 13 12800000 0 4220.7 22734141 14 14400000 0 4238.5 31936458 16 16000000 0 4231.5 23409901 18 17600000 0 4045.3 23577700 19 19200000 0 2783.4 58299526 21 20800000 0 2678.2 40616302 23 22400000 0 2693.5 83973996 and xfs complaining about memory allocation not making progress [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240) [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240) [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) * Patched kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4289.1 19243934 3 3200000 0 4241.6 32828865 5 4800000 0 4248.7 32884693 6 6400000 0 4314.4 19608921 8 8000000 0 4269.9 24953292 9 9600000 0 4270.7 33235572 11 11200000 0 4346.4 40817101 13 12800000 0 4285.3 29972397 14 14400000 0 4297.2 20539765 16 16000000 0 4219.6 18596767 18 17600000 0 4273.8 49611187 19 19200000 0 4300.4 27944451 21 20800000 0 4270.6 22324585 22 22400000 0 4317.6 22650382 24 24000000 0 4065.2 22297964 So the dropdown at Count 19200000 didn't happen and there was only a single warning about allocation not making progress [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) This suggests that the patch has helped even though there is not all that much of anonymous memory as the workload mostly generates fs metadata. I assume the success rate would be higher with more anonymous memory which should be the case in many workloads. [akpm@linux-foundation.org: fix comment] Link: http://lkml.kernel.org/r/20161012114721.31853-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-14 16:04:07 -08:00
Linus Torvalds	e34bac726d	Merge branch 'akpm' (patches from Andrew) Merge updates from Andrew Morton: - various misc bits - most of MM (quite a lot of MM material is awaiting the merge of linux-next dependencies) - kasan - printk updates - procfs updates - MAINTAINERS - /lib updates - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits) init: reduce rootwait polling interval time to 5ms binfmt_elf: use vmalloc() for allocation of vma_filesz checkpatch: don't emit unified-diff error for rename-only patches checkpatch: don't check c99 types like uint8_t under tools checkpatch: avoid multiple line dereferences checkpatch: don't check .pl files, improve absolute path commit log test scripts/checkpatch.pl: fix spelling checkpatch: don't try to get maintained status when --no-tree is given lib/ida: document locking requirements a bit better lib/rbtree.c: fix typo in comment of ____rb_erase_color lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM MAINTAINERS: add drm and drm/i915 irc channels MAINTAINERS: add "C:" for URI for chat where developers hang out MAINTAINERS: add drm and drm/i915 bug filing info MAINTAINERS: add "B:" for URI where to file bugs get_maintainer: look for arbitrary letter prefixes in sections printk: add Kconfig option to set default console loglevel printk/sound: handle more message headers printk/btrfs: handle more message headers printk/kdb: handle more message headers ...	2016-12-12 20:50:02 -08:00
Ming Ling	6afcf8ef0c	mm, compaction: fix NR_ISOLATED_* stats for pfn based migration Since commit bda807d44454 ("mm: migrate: support non-lru movable page migration") isolate_migratepages_block) can isolate !PageLRU pages which would acct_isolated account as NR_ISOLATED_. Accounting these non-lru pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide heuristics based on those counters such as pgdat_reclaimable_pages resp. too_many_isolated which would lead to unexpected stalls during the direct reclaim without any good reason. Note that __alloc_contig_migrate_range can isolate a lot of pages at once. On mobile devices such as 512M ram android Phone, it may use a big zram swap. In some cases zram(zsmalloc) uses too many non-lru but migratedable pages, such as: MemTotal: 468148 kB Normal free:5620kB Free swap:4736kB Total swap:409596kB ZRAM: 164616kB(zsmalloc non-lru pages) active_anon:60700kB inactive_anon:60744kB active_file:34420kB inactive_file:37532kB Fix this by only accounting lru pages to NR_ISOLATED_ in isolate_migratepages_block right after they were isolated and we still know they were on LRU. Drop acct_isolated because it is called after the fact and we've lost that information. Batching per-cpu counter doesn't make much improvement anyway. Also make sure that we uncharge only LRU pages when putting them back on the LRU in putback_movable_pages resp. when unmap_and_move migrates the page. [mhocko@suse.com: replace acct_isolated() with direct counting] Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration") Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org Signed-off-by: Ming Ling <ming.ling@spreadtrum.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-12-12 18:55:07 -08:00
Anna-Maria Gleixner	e46b1db249	mm/compaction: Convert to hotplug state machine Install the callbacks via the state machine. Should the hotplug init fail then no threads are spawned. Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: linux-mm@kvack.org Cc: rt@linutronix.de Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20161126231350.10321-15-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-12-02 00:52:37 +01:00
Vlastimil Babka	2031142028	mm, compaction: restrict fragindex to costly orders Fragmentation index and the vm.extfrag_threshold sysctl is meant as a heuristic to prevent excessive compaction for costly orders (i.e. THP). It's unlikely to make any difference for non-costly orders, especially with the default threshold. But we cannot afford any uncertainty for the non-costly orders where the only alternative to successful reclaim/compaction is OOM. After the recent patches we are guaranteed maximum effort without heuristics from compaction before deciding OOM, and fragindex is the last remaining heuristic. Therefore skip fragindex altogether for non-costly orders. Suggested-by: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:29 -07:00
Vlastimil Babka	cc5c9f098f	mm, compaction: ignore fragindex from compaction_zonelist_suitable() The compaction_zonelist_suitable() function tries to determine if compaction will be able to proceed after sufficient reclaim, i.e. whether there are enough reclaimable pages to provide enough order-0 freepages for compaction. This addition of reclaimable pages to the free pages works well for the order-0 watermark check, but in the fragmentation index check we only consider truly free pages. Thus we can get fragindex value close to 0 which indicates failure do to lack of memory, and wrongly decide that compaction won't be suitable even after reclaim. Instead of trying to somehow adjust fragindex for reclaimable pages, let's just skip it from compaction_zonelist_suitable(). Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:29 -07:00
Vlastimil Babka	9f7e338793	mm, compaction: make full priority ignore pageblock suitability Several people have reported premature OOMs for order-2 allocations (stack) due to OOM rework in 4.7. In the scenario (parallel kernel build and dd writing to two drives) many pageblocks get marked as Unmovable and compaction free scanner struggles to isolate free pages. Joonsoo Kim pointed out that the free scanner skips pageblocks that are not movable to prevent filling them and forcing non-movable allocations to fallback to other pageblocks. Such heuristic makes sense to help prevent long-term fragmentation, but premature OOMs are relatively more urgent problem. As a compromise, this patch disables the heuristic only for the ultimate compaction priority. Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz Reported-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com> Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com> Reported-by: Olaf Hering <olaf@aepfle.de> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:29 -07:00
Vlastimil Babka	8348faf91f	mm, compaction: require only min watermarks for non-costly orders The __compaction_suitable() function checks the low watermark plus a compact_gap() gap to decide if there's enough free memory to perform compaction. Then __isolate_free_page uses low watermark check to decide if particular free page can be isolated. In the latter case, using low watermark is needlessly pessimistic, as the free page isolations are only temporary. For __compaction_suitable() the higher watermark makes sense for high-order allocations where more freepages increase the chance of success, and we can typically fail with some order-0 fallback when the system is struggling to reach that watermark. But for low-order allocation, forming the page should not be that hard. So using low watermark here might just prevent compaction from even trying, and eventually lead to OOM killer even if we are above min watermarks. So after this patch, we use min watermark for non-costly orders in __compaction_suitable(), and for all orders in __isolate_free_page(). [vbabka@suse.cz: clarify __isolate_free_page() comment] Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	984fdba6a3	mm, compaction: use proper alloc_flags in __compaction_suitable() The __compaction_suitable() function checks the low watermark plus a compact_gap() gap to decide if there's enough free memory to perform compaction. This check uses direct compactor's alloc_flags, but that's wrong, since these flags are not applicable for freepage isolation. For example, alloc_flags may indicate access to memory reserves, making compaction proceed, and then fail watermark check during the isolation. A similar problem exists for ALLOC_CMA, which may be part of alloc_flags, but not during freepage isolation. In this case however it makes sense to use ALLOC_CMA both in __compaction_suitable() and __isolate_free_page(), since there's actually nothing preventing the freepage scanner to isolate from CMA pageblocks, with the assumption that a page that could be migrated once by compaction can be migrated also later by CMA allocation. Thus we should count pages in CMA pageblocks when considering compaction suitability and when isolating freepages. To sum up, this patch should remove some false positives from __compaction_suitable(), and allow compaction to proceed when free pages required for compaction reside in the CMA pageblocks. Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	9861a62c33	mm, compaction: create compact_gap wrapper Compaction uses a watermark gap of (2UL << order) pages at various places and it's not immediately obvious why. Abstract it through a compact_gap() wrapper to create a single place with a thorough explanation. [vbabka@suse.cz: clarify the comment of compact_gap()] Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	f2b8228c5f	mm, compaction: use correct watermark when checking compaction success The __compact_finished() function uses low watermark in a check that has to pass if the direct compaction is to finish and allocation should succeed. This is too pessimistic, as the allocation will typically use min watermark. It may happen that during compaction, we drop below the low watermark (due to parallel activity), but still form the target high-order page. By checking against low watermark, we might needlessly continue compaction. Similarly, __compaction_suitable() uses low watermark in a check whether allocation can succeed without compaction. Again, this is unnecessarily pessimistic. After this patch, these check will use direct compactor's alloc_flags to determine the watermark, which is effectively the min watermark. Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	a8e025e55b	mm, compaction: add the ultimate direct compaction priority During reclaim/compaction loop, it's desirable to get a final answer from unsuccessful compaction so we can either fail the allocation or invoke the OOM killer. However, heuristics such as deferred compaction or pageblock skip bits can cause compaction to skip parts or whole zones and lead to premature OOM's, failures or excessive reclaim/compaction retries. To remedy this, we introduce a new direct compaction priority called COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to: - ignore deferred compaction status for a zone - ignore pageblock skip hints - ignore cached scanner positions and scan the whole zone The new priority should get eventually picked up by should_compact_retry() and this should improve success rates for costly allocations using __GFP_REPEAT, such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly allocations. Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz [vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias] Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	7ceb009a22	mm, compaction: don't recheck watermarks after COMPACT_SUCCESS Joonsoo has reminded me that in a later patch changing watermark checks throughout compaction I forgot to update checks in try_to_compact_pages() and compactd_do_work(). Closer inspection however shows that they are redundant now in the success case, because compact_zone() now reliably reports this with COMPACT_SUCCESS. So effectively the checks just repeat (a subset) of checks that have just passed. So instead of checking watermarks again, just test the return value. Note it's also possible that compaction would declare failure e.g. because its find_suitable_fallback() is more strict than simple watermark check, and then the watermark check we are removing would then still succeed. After this patch this is not possible and it's arguably better, because for long-term fragmentation avoidance we should rather try a different zone than allocate with the unsuitable fallback. If compaction of all zones fail and the allocation is important enough, it will retry and succeed anyway. Also remove the stray "bool success" variable from kcompactd_do_work(). Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	cf378319d3	mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS COMPACT_PARTIAL has historically meant that compaction returned after doing some work without fully compacting a zone. It however didn't distinguish if compaction terminated because it succeeded in creating the requested high-order page. This has changed recently and now we only return COMPACT_PARTIAL when compaction thinks it succeeded, or the high-order watermark check in compaction_suitable() passes and no compaction needs to be done. So at this point we can make the return value clearer by renaming it to COMPACT_SUCCESS. The next patch will remove some redundant tests for success where compaction just returned COMPACT_SUCCESS. Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	791cae9620	mm, compaction: cleanup unused functions Since kswapd compaction moved to kcompactd, compact_pgdat() is not called anymore, so we remove it. The only caller of __compact_pgdat() is compact_node(), so we merge them and remove code that was only reachable from kswapd. Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	06ed29989f	mm, compaction: make whole_zone flag ignore cached scanner positions Patch series "make direct compaction more deterministic") This is mostly a followup to Michal's oom detection rework, which highlighted the need for direct compaction to provide better feedback in reclaim/compaction loop, so that it can reliably recognize when compaction cannot make further progress, and allocation should invoke OOM killer or fail. We've discussed this at LSF/MM [1] where I proposed expanding the async/sync migration mode used in compaction to more general "priorities". This patchset adds one new priority that just overrides all the heuristics and makes compaction fully scan all zones. I don't currently think that we need more fine-grained priorities, but we'll see. Other than that there's some smaller fixes and cleanups, mainly related to the THP-specific hacks. I've tested this with stress-highalloc in GFP_KERNEL order-4 and THP-like order-9 scenarios. There's some improvement for compaction stats for the order-4, which is likely due to the better watermarks handling. In the previous version I reported mostly noise wrt compaction stats, and decreased direct reclaim - now the reclaim is without difference. I believe this is due to the less aggressive compaction priority increase in patch 6. "before" is a mmotm tree prior to 4.7 release plus the first part of the series that was sent and merged separately before after order-4: Compaction stalls 27216 30759 Compaction success 19598 25475 Compaction failures 7617 5283 Page migrate success 370510 464919 Page migrate failure 25712 27987 Compaction pages isolated 849601 1041581 Compaction migrate scanned 143146541 101084990 Compaction free scanned 208355124 144863510 Compaction cost 1403 1210 order-9: Compaction stalls 7311 7401 Compaction success 1634 1683 Compaction failures 5677 5718 Page migrate success 194657 183988 Page migrate failure 4753 4170 Compaction pages isolated 498790 456130 Compaction migrate scanned 565371 524174 Compaction free scanned 4230296 4250744 Compaction cost 215 203 [1] https://lwn.net/Articles/684611/ This patch (of 11): A recent patch has added whole_zone flag that compaction sets when scanning starts from the zone boundary, in order to report that zone has been fully scanned in one attempt. For allocations that want to try really hard or cannot fail, we will want to introduce a mode where scanning whole zone is guaranteed regardless of the cached positions. This patch reuses the whole_zone flag in a way that if it's already passed true to compaction, the cached scanner positions are ignored. Employing this flag during reclaim/compaction loop will be done in the next patch. This patch however converts compaction invoked from userspace via procfs to use this flag. Before this patch, the cached positions were first reset to zone boundaries and then read back from struct zone, so there was a window where a parallel compaction could replace the reset values, making the manual compaction less effective. Using the flag instead of performing reset is more robust. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-10-07 18:46:27 -07:00
Vlastimil Babka	c3486f5376	mm, compaction: simplify contended compaction handling Async compaction detects contention either due to failing trylock on zone->lock or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction: khugepaged should not give up due to need_resched()") the code got quite complicated to distinguish these two up to the __alloc_pages_slowpath() level, so different decisions could be taken for khugepaged allocations. After the recent changes, khugepaged allocations don't check for contended compaction anymore, so we again don't need to distinguish lock and sched contention, and simplify the current convoluted code a lot. However, I believe it's also possible to simplify even more and completely remove the check for contended compaction after the initial async compaction for costly orders, which was originally aimed at THP page fault allocations. There are several reasons why this can be done now: - with the new defaults, THP page faults no longer do reclaim/compaction at all, unless the system admin has overridden the default, or application has indicated via madvise that it can benefit from THP's. In both cases, it means that the potential extra latency is expected and worth the benefits. - even if reclaim/compaction proceeds after this patch where it previously wouldn't, the second compaction attempt is still async and will detect the contention and back off, if the contention persists - there are still heuristics like deferred compaction and pageblock skip bits in place that prevent excessive THP page fault latencies Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Vlastimil Babka	a5508cd83f	mm, compaction: introduce direct compaction priority In the context of direct compaction, for some types of allocations we would like the compaction to either succeed or definitely fail while trying as hard as possible. Current async/sync_light migration mode is insufficient, as there are heuristics such as caching scanner positions, marking pageblocks as unsuitable or deferring compaction for a zone. At least the final compaction attempt should be able to override these heuristics. To communicate how hard compaction should try, we replace migration mode with a new enum compact_priority and change the relevant function signatures. In compact_zone_order() where struct compact_control is constructed, the priority is mapped to suitable control flags. This patch itself has no functional change, as the current priority levels are mapped back to the same migration modes as before. Expanding them will be done next. Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as the only caller exists under CONFIG_COMPACTION. Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Hugh Dickins	1d2047fefa	mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to isolate a PageWriteback page, which __unmap_and_move() then rejects with -EBUSY: of course the writeback might complete in between, but that's not what we usually expect, so probably better not to isolate it. When tested by stress-highalloc from mmtests, this has reduced the number of page migrate failures by 60-70%. Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	5a1c84b404	mm: remove reclaim and compaction retry approximations If per-zone LRU accounting is available then there is no point approximating whether reclaim and compaction should retry based on pgdat statistics. This is effectively a revert of "mm, vmstat: remove zone and node double accounting by approximating retries" with the difference that inactive/active stats are still available. This preserves the history of why the approximation was retried and why it had to be reverted to handle OOM kills on 32-bit systems. Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	bca6759258	mm, vmstat: remove zone and node double accounting by approximating retries The number of LRU pages, dirty pages and writeback pages must be accounted for on both zones and nodes because of the reclaim retry logic, compaction retry logic and highmem calculations all depending on per-zone stats. Many lowmem allocations are immune from OOM kill due to a check in __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception is costly high-order allocations or allocations that cannot fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations then it would fall through to __alloc_pages_direct_compact. This patch will blindly retry reclaim for zone-constrained allocations in should_reclaim_retry up to MAX_RECLAIM_RETRIES. This is not ideal but without per-zone stats there are not many alternatives. The impact it that zone-constrained allocations may delay before considering the OOM killer. As there is no guarantee enough memory can ever be freed to satisfy compaction, this patch avoids retrying compaction for zone-contrained allocations. In combination, that means that the per-node stats can be used when deciding whether to continue reclaim using a rough approximation. While it is possible this will make the wrong decision on occasion, it will not infinite loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES. The final step is calculating the number of dirtyable highmem pages. As those calculations only care about the global count of file pages in highmem. This patch uses a global counter used instead of per-zone stats as it is sufficient. In combination, this allows the per-zone LRU and dirty state counters to be removed. [mgorman@techsingularity.net: fix acct_highmem_file_pages()] Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Suggested by: Michal Hocko <mhocko@kernel.org> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	599d0c954f	mm, vmscan: move LRU lists to node This moves the LRU lists from the zone to the node and related data such as counters, tracing, congestion tracking and writeback tracking. Unfortunately, due to reclaim and compaction retry logic, it is necessary to account for the number of LRU pages on both zone and node logic. Most reclaim logic is based on the node counters but the retry logic uses the zone counters which do not distinguish inactive and active sizes. It would be possible to leave the LRU counters on a per-zone basis but it's a heavier calculation across multiple cache lines that is much more frequent than the retry checks. Other than the LRU counters, this is mostly a mechanical patch but note that it introduces a number of anomalies. For example, the scans are per-zone but using per-node counters. We also mark a node as congested when a zone is congested. This causes weird problems that are fixed later but is easier to review. In the event that there is excessive overhead on 32-bit systems due to the nodes being on LRU then there are two potential solutions 1. Long-term isolation of highmem pages when reclaim is lowmem When pages are skipped, they are immediately added back onto the LRU list. If lowmem reclaim persisted for long periods of time, the same highmem pages get continually scanned. The idea would be that lowmem keeps those pages on a separate list until a reclaim for highmem pages arrives that splices the highmem pages back onto the LRU. It potentially could be implemented similar to the UNEVICTABLE list. That would reduce the skip rate with the potential corner case is that highmem pages have to be scanned and reclaimed to free lowmem slab pages. 2. Linear scan lowmem pages if the initial LRU shrink fails This will break LRU ordering but may be preferable and faster during memory pressure than skipping LRU pages. Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Mel Gorman	a52633d8e9	mm, vmscan: move lru_lock to the node Node-based reclaim requires node-based LRUs and locking. This is a preparation patch that just moves the lru_lock to the node so later patches are easier to review. It is a mechanical change but note this patch makes contention worse because the LRU lock is hotter and direct reclaim and kswapd can contend on the same lock even when reclaiming from different zones. Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00
Ganesh Mahendran	b2b331f966	mm/compaction: remove unnecessary order check in try_to_compact_pages() The caller __alloc_pages_direct_compact() already checked (order == 0) so there's no need to check again. Link: http://lkml.kernel.org/r/1465973568-3496-1-git-send-email-opensource.ganesh@gmail.com Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-07-28 16:07:41 -07:00

1 2 3 4 5

238 Commits