122 Commits

Author SHA1 Message Date
Arian
3a330c6445 Merge branch 'android-4.14-stable' of https://android.googlesource.com/kernel/common into HEAD
Change-Id: I714223aa1f97959bd97b6bf758511466c9394bd8
2022-03-16 00:46:24 +01:00
Greg Kroah-Hartman
c1013a481e This is the 4.14.200 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl91udkACgkQONu9yGCS
 aT5wdw//Wn7Nd1kcd1h2umXWcn2rjLbrflmFMtAY32YSDhT85EQaTQMckeLzVKSi
 KPIPdNbjKhOirlizM6+mBv3EKn3bNbdDFeq0qGy0WI7DAT2Amyn8JRi18GKgM1z/
 BBzmOldG2oeFUlFes5ADdVm0b3p+iiQ5WqzIJvLkBeV1HrLAQ2/zAl6ixpUwP2wr
 sZymWkCXUUG2ZYn7ZxX6SXKvfvSv+GKUD0ImX82Q1DvAcQHSyiZ9/o9/pZbS4m1t
 innS1EXJEEZHd1x61EBu2WrQkNwXvlpTDZNa7at9Lc5Uys6NsgfUkx4+fTyqDT8h
 FfnxGaP9qHJRImVpxdBHsZGvUheDTVPPUk3eNNyeR1wmEz/OPCQmPIPIz0sw1TJt
 LL0d4XN1rLjRSUGv+q1+Y6nbc85eU7nyJG/qB35Fag4z27x3P7D/4queR1hAij37
 xp6MesrmEHPA7I4xV9jTzCRvY+O5MAughhOJMzZrn2f95EpC3Mg48OHdYRtktc1b
 4L9Vb/HeQ6s0hgVZhkdNfoBPsi797YNCd5WNccKpAdq17mG5s6+6qJxBgCqgHS/i
 4gjeR2IgSOL0KW2eE7STn8fgyY86ZepNsONkX6I+7n1h0lC7gVH6VtW5fzyvg9Bw
 RLA5ulOqUc+f8Ll7MwzbRL3+wiNN4jkX5NHYpaQB2nWVEO9A+cs=
 =76oN
 -----END PGP SIGNATURE-----

Merge 4.14.200 into android-4.14-stable

Changes in 4.14.200
	af_key: pfkey_dump needs parameter validation
	phy: qcom-qmp: Use correct values for ipq8074 PCIe Gen2 PHY init
	KVM: fix memory leak in kvm_io_bus_unregister_dev()
	kprobes: fix kill kprobe which has been marked as gone
	mm/thp: fix __split_huge_pmd_locked() for migration PMD
	RDMA/ucma: ucma_context reference leak in error path
	hdlc_ppp: add range checks in ppp_cp_parse_cr()
	ip: fix tos reflection in ack and reset packets
	net: ipv6: fix kconfig dependency warning for IPV6_SEG6_HMAC
	tipc: fix shutdown() of connection oriented socket
	tipc: use skb_unshare() instead in tipc_buf_append()
	bnxt_en: Protect bnxt_set_eee() and bnxt_set_pauseparam() with mutex.
	net: phy: Avoid NPD upon phy_detach() when driver is unbound
	net: add __must_check to skb_put_padto()
	ipv4: Update exception handling for multipath routes via same device
	geneve: add transport ports in route lookup for geneve
	serial: 8250: Avoid error message on reprobe
	mm: fix double page fault on arm64 if PTE_AF is cleared
	scsi: aacraid: fix illegal IO beyond last LBA
	m68k: q40: Fix info-leak in rtc_ioctl
	gma/gma500: fix a memory disclosure bug due to uninitialized bytes
	ASoC: kirkwood: fix IRQ error handling
	media: smiapp: Fix error handling at NVM reading
	arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback
	x86/ioapic: Unbreak check_timer()
	ALSA: usb-audio: Add delay quirk for H570e USB headsets
	ALSA: hda/realtek - Couldn't detect Mic if booting with headset plugged
	PM / devfreq: tegra30: Fix integer overflow on CPU's freq max out
	scsi: fnic: fix use after free
	clk/ti/adpll: allocate room for terminating null
	mtd: cfi_cmdset_0002: don't free cfi->cfiq in error path of cfi_amdstd_setup()
	mfd: mfd-core: Protect against NULL call-back function pointer
	tracing: Adding NULL checks for trace_array descriptor pointer
	bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
	RDMA/i40iw: Fix potential use after free
	xfs: fix attr leaf header freemap.size underflow
	RDMA/iw_cgxb4: Fix an error handling path in 'c4iw_connect()'
	mmc: core: Fix size overflow for mmc partitions
	gfs2: clean up iopen glock mess in gfs2_create_inode
	debugfs: Fix !DEBUG_FS debugfs_create_automount
	CIFS: Properly process SMB3 lease breaks
	kernel/sys.c: avoid copying possible padding bytes in copy_to_user
	neigh_stat_seq_next() should increase position index
	rt_cpu_seq_next should increase position index
	seqlock: Require WRITE_ONCE surrounding raw_seqcount_barrier
	media: ti-vpe: cal: Restrict DMA to avoid memory corruption
	ACPI: EC: Reference count query handlers under lock
	dmaengine: zynqmp_dma: fix burst length configuration
	powerpc/eeh: Only dump stack once if an MMIO loop is detected
	tracing: Set kernel_stack's caller size properly
	ar5523: Add USB ID of SMCWUSBT-G2 wireless adapter
	selftests/ftrace: fix glob selftest
	tools/power/x86/intel_pstate_tracer: changes for python 3 compatibility
	Bluetooth: Fix refcount use-after-free issue
	mm: pagewalk: fix termination condition in walk_pte_range()
	Bluetooth: prefetch channel before killing sock
	KVM: fix overflow of zero page refcount with ksm running
	ALSA: hda: Clear RIRB status before reading WP
	skbuff: fix a data race in skb_queue_len()
	audit: CONFIG_CHANGE don't log internal bookkeeping as an event
	selinux: sel_avc_get_stat_idx should increase position index
	scsi: lpfc: Fix RQ buffer leakage when no IOCBs available
	scsi: lpfc: Fix coverity errors in fmdi attribute handling
	drm/omap: fix possible object reference leak
	perf test: Fix test trace+probe_vfs_getname.sh on s390
	RDMA/rxe: Fix configuration of atomic queue pair attributes
	KVM: x86: fix incorrect comparison in trace event
	media: staging/imx: Missing assignment in imx_media_capture_device_register()
	x86/pkeys: Add check for pkey "overflow"
	bpf: Remove recursion prevention from rcu free callback
	dmaengine: tegra-apb: Prevent race conditions on channel's freeing
	media: go7007: Fix URB type for interrupt handling
	Bluetooth: guard against controllers sending zero'd events
	timekeeping: Prevent 32bit truncation in scale64_check_overflow()
	ext4: fix a data race at inode->i_disksize
	mm: avoid data corruption on CoW fault into PFN-mapped VMA
	drm/amdgpu: increase atombios cmd timeout
	ath10k: use kzalloc to read for ath10k_sdio_hif_diag_read
	scsi: aacraid: Disabling TM path and only processing IOP reset
	Bluetooth: L2CAP: handle l2cap config request during open state
	media: tda10071: fix unsigned sign extension overflow
	xfs: don't ever return a stale pointer from __xfs_dir3_free_read
	tpm: ibmvtpm: Wait for buffer to be set before proceeding
	rtc: ds1374: fix possible race condition
	tracing: Use address-of operator on section symbols
	serial: 8250_port: Don't service RX FIFO if throttled
	serial: 8250_omap: Fix sleeping function called from invalid context during probe
	serial: 8250: 8250_omap: Terminate DMA before pushing data on RX timeout
	perf cpumap: Fix snprintf overflow check
	cpufreq: powernv: Fix frame-size-overflow in powernv_cpufreq_work_fn
	tools: gpio-hammer: Avoid potential overflow in main
	RDMA/rxe: Set sys_image_guid to be aligned with HW IB devices
	SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
	svcrdma: Fix leak of transport addresses
	ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len
	ALSA: usb-audio: Fix case when USB MIDI interface has more than one extra endpoint descriptor
	NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
	mm/kmemleak.c: use address-of operator on section symbols
	mm/filemap.c: clear page error before actual read
	mm/vmscan.c: fix data races using kswapd_classzone_idx
	mm/mmap.c: initialize align_offset explicitly for vm_unmapped_area
	scsi: qedi: Fix termination timeouts in session logout
	serial: uartps: Wait for tx_empty in console setup
	KVM: Remove CREATE_IRQCHIP/SET_PIT2 race
	bdev: Reduce time holding bd_mutex in sync in blkdev_close()
	drivers: char: tlclk.c: Avoid data race between init and interrupt handler
	staging:r8188eu: avoid skb_clone for amsdu to msdu conversion
	sparc64: vcc: Fix error return code in vcc_probe()
	arm64: cpufeature: Relax checks for AArch32 support at EL[0-2]
	dt-bindings: sound: wm8994: Correct required supplies based on actual implementaion
	atm: fix a memory leak of vcc->user_back
	power: supply: max17040: Correct voltage reading
	phy: samsung: s5pv210-usb2: Add delay after reset
	Bluetooth: Handle Inquiry Cancel error after Inquiry Complete
	USB: EHCI: ehci-mv: fix error handling in mv_ehci_probe()
	tty: serial: samsung: Correct clock selection logic
	ALSA: hda: Fix potential race in unsol event handler
	powerpc/traps: Make unrecoverable NMIs die instead of panic
	fuse: don't check refcount after stealing page
	USB: EHCI: ehci-mv: fix less than zero comparison of an unsigned int
	arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
	e1000: Do not perform reset in reset_task if we are already down
	drm/nouveau/debugfs: fix runtime pm imbalance on error
	printk: handle blank console arguments passed in.
	usb: dwc3: Increase timeout for CmdAct cleared by device controller
	btrfs: don't force read-only after error in drop snapshot
	vfio/pci: fix memory leaks of eventfd ctx
	perf util: Fix memory leak of prefix_if_not_in
	perf kcore_copy: Fix module map when there are no modules loaded
	mtd: rawnand: omap_elm: Fix runtime PM imbalance on error
	ceph: fix potential race in ceph_check_caps
	mm/swap_state: fix a data race in swapin_nr_pages
	rapidio: avoid data race between file operation callbacks and mport_cdev_add().
	mtd: parser: cmdline: Support MTD names containing one or more colons
	x86/speculation/mds: Mark mds_user_clear_cpu_buffers() __always_inline
	vfio/pci: Clear error and request eventfd ctx after releasing
	cifs: Fix double add page to memcg when cifs_readpages
	scsi: libfc: Handling of extra kref
	scsi: libfc: Skip additional kref updating work event
	selftests/x86/syscall_nt: Clear weird flags after each test
	vfio/pci: fix racy on error and request eventfd ctx
	btrfs: qgroup: fix data leak caused by race between writeback and truncate
	s390/init: add missing __init annotations
	i2c: core: Call i2c_acpi_install_space_handler() before i2c_acpi_register_devices()
	objtool: Fix noreturn detection for ignored functions
	ieee802154: fix one possible memleak in ca8210_dev_com_init
	ieee802154/adf7242: check status of adf7242_read_reg
	clocksource/drivers/h8300_timer8: Fix wrong return value in h8300_8timer_init()
	mwifiex: Increase AES key storage size to 256 bits
	batman-adv: bla: fix type misuse for backbone_gw hash indexing
	atm: eni: fix the missed pci_disable_device() for eni_init_one()
	batman-adv: mcast/TT: fix wrongly dropped or rerouted packets
	mac802154: tx: fix use-after-free
	drm/vc4/vc4_hdmi: fill ASoC card owner
	net: qed: RDMA personality shouldn't fail VF load
	batman-adv: Add missing include for in_interrupt()
	batman-adv: mcast: fix duplicate mcast packets in BLA backbone from mesh
	ALSA: asihpi: fix iounmap in error handler
	MIPS: Add the missing 'CPU_1074K' into __get_cpu_type()
	s390/dasd: Fix zero write for FBA devices
	kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()
	mm, THP, swap: fix allocating cluster for swapfile by mistake
	lib/string.c: implement stpcpy
	ata: define AC_ERR_OK
	ata: make qc_prep return ata_completion_errors
	ata: sata_mv, avoid trigerrable BUG_ON
	Linux 4.14.200

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I3d3049dca196c46cb6b2a66d60a5a6a3a099efbb
2020-10-01 17:59:29 +02:00
Qian Cai
6d9fdd1325 mm/swap_state: fix a data race in swapin_nr_pages
[ Upstream commit d6c1f098f2a7ba62627c9bc17cda28f534ef9e4a ]

"prev_offset" is a static variable in swapin_nr_pages() that can be
accessed concurrently with only mmap_sem held in read mode as noticed by
KCSAN,

 BUG: KCSAN: data-race in swap_cluster_readahead / swap_cluster_readahead

 write to 0xffffffff92763830 of 8 bytes by task 14795 on cpu 17:
  swap_cluster_readahead+0x2a6/0x5e0
  swapin_readahead+0x92/0x8dc
  do_swap_page+0x49b/0xf20
  __handle_mm_fault+0xcfb/0xd70
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x715
  page_fault+0x34/0x40

 1 lock held by (dnf)/14795:
  #0: ffff897bd2e98858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715
  do_user_addr_fault at arch/x86/mm/fault.c:1405
  (inlined by) do_page_fault at arch/x86/mm/fault.c:1535
 irq event stamp: 83493
 count_memcg_event_mm+0x1a6/0x270
 count_memcg_event_mm+0x119/0x270
 __do_softirq+0x365/0x589
 irq_exit+0xa2/0xc0

 read to 0xffffffff92763830 of 8 bytes by task 1 on cpu 22:
  swap_cluster_readahead+0xfd/0x5e0
  swapin_readahead+0x92/0x8dc
  do_swap_page+0x49b/0xf20
  __handle_mm_fault+0xcfb/0xd70
  handle_mm_fault+0xfc/0x2f0
  do_page_fault+0x263/0x715
  page_fault+0x34/0x40

 1 lock held by systemd/1:
  #0: ffff897c38f14858 (&mm->mmap_sem#2){++++}-{3:3}, at: do_page_fault+0x143/0x715
 irq event stamp: 43530289
 count_memcg_event_mm+0x1a6/0x270
 count_memcg_event_mm+0x119/0x270
 __do_softirq+0x365/0x589
 irq_exit+0xa2/0xc0

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200402213748.2237-1-cai@lca.pw
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01 13:12:46 +02:00
Srinivasarao P
0190a01fb1 Merge android-4.14-stable.190 (d2d05bc) into msm-4.14
* refs/heads/tmp-d2d05bc:
  Linux 4.14.190
  ath9k: Fix regression with Atheros 9271
  ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb
  parisc: Add atomic64_set_release() define to avoid CPU soft lockups
  io-mapping: indicate mapping failure
  mm/memcg: fix refcount error while moving and swapping
  Makefile: Fix GCC_TOOLCHAIN_DIR prefix for Clang cross compilation
  vt: Reject zero-sized screen buffer size.
  fbdev: Detect integer underflow at "struct fbcon_ops"->clear_margins.
  serial: 8250_mtk: Fix high-speed baud rates clamping
  serial: 8250: fix null-ptr-deref in serial8250_start_tx()
  staging: comedi: addi_apci_1564: check INSN_CONFIG_DIGITAL_TRIG shift
  staging: comedi: addi_apci_1500: check INSN_CONFIG_DIGITAL_TRIG shift
  staging: comedi: ni_6527: fix INSN_CONFIG_DIGITAL_TRIG support
  staging: comedi: addi_apci_1032: check INSN_CONFIG_DIGITAL_TRIG shift
  staging: wlan-ng: properly check endpoint types
  Revert "cifs: Fix the target file was deleted when rename failed."
  usb: xhci: Fix ASM2142/ASM3142 DMA addressing
  usb: xhci-mtk: fix the failure of bandwidth allocation
  binder: Don't use mmput() from shrinker function.
  x86: math-emu: Fix up 'cmp' insn for clang ias
  arm64: Use test_tsk_thread_flag() for checking TIF_SINGLESTEP
  usb: gadget: udc: gr_udc: fix memleak on error handling path in gr_ep_init()
  Input: synaptics - enable InterTouch for ThinkPad X1E 1st gen
  dmaengine: ioat setting ioat timeout as module parameter
  hwmon: (aspeed-pwm-tacho) Avoid possible buffer overflow
  regmap: dev_get_regmap_match(): fix string comparison
  spi: mediatek: use correct SPI_CFG2_REG MACRO
  Input: add `SW_MACHINE_COVER`
  dmaengine: tegra210-adma: Fix runtime PM imbalance on error
  HID: apple: Disable Fn-key key-re-mapping on clone keyboards
  HID: i2c-hid: add Mediacom FlexBook edge13 to descriptor override
  scripts/decode_stacktrace: strip basepath from all paths
  serial: exar: Fix GPIO configuration for Sealevel cards based on XR17V35X
  bonding: check return value of register_netdevice() in bond_newlink()
  i2c: rcar: always clear ICSAR to avoid side effects
  ipvs: fix the connection sync failed in some cases
  mlxsw: destroy workqueue when trap_register in mlxsw_emad_init
  bonding: check error value of register_netdevice() immediately
  net: smc91x: Fix possible memory leak in smc_drv_probe()
  drm: sun4i: hdmi: Fix inverted HPD result
  net: dp83640: fix SIOCSHWTSTAMP to update the struct with actual configuration
  ax88172a: fix ax88172a_unbind() failures
  hippi: Fix a size used in a 'pci_free_consistent()' in an error handling path
  bnxt_en: Fix race when modifying pause settings.
  btrfs: fix page leaks after failure to lock page for delalloc
  btrfs: fix mount failure caused by race with umount
  btrfs: fix double free on ulist after backref resolution failure
  ASoC: rt5670: Correct RT5670_LDO_SEL_MASK
  ALSA: info: Drop WARN_ON() from buffer NULL sanity check
  uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression
  IB/umem: fix reference count leak in ib_umem_odp_get()
  spi: spi-fsl-dspi: Exit the ISR with IRQ_NONE when it's not ours
  SUNRPC reverting d03727b248d0 ("NFSv4 fix CLOSE not waiting for direct IO compeletion")
  irqdomain/treewide: Keep firmware node unconditionally allocated
  drm/nouveau/i2c/g94-: increase NV_PMGR_DP_AUXCTL_TRANSACTREQ timeout
  net: sky2: initialize return of gm_phy_read
  drivers/net/wan/lapbether: Fixed the value of hard_header_len
  xtensa: update *pos in cpuinfo_op.next
  xtensa: fix __sync_fetch_and_{and,or}_4 declarations
  scsi: scsi_transport_spi: Fix function pointer check
  mac80211: allow rx of mesh eapol frames with default rx key
  pinctrl: amd: fix npins for uart0 in kerncz_groups
  gpio: arizona: put pm_runtime in case of failure
  gpio: arizona: handle pm_runtime_get_sync failure case
  ANDROID: Incremental fs: magic number compatible 32-bit
  ANDROID: kbuild: don't merge .*..compoundliteral in modules
  Revert "arm64/alternatives: use subsections for replacement sequences"
  Linux 4.14.189
  rxrpc: Fix trace string
  libceph: don't omit recovery_deletes in target_copy()
  x86/cpu: Move x86_cache_bits settings
  sched/fair: handle case of task_h_load() returning 0
  arm64: ptrace: Override SPSR.SS when single-stepping is enabled
  thermal/drivers/cpufreq_cooling: Fix wrong frequency converted from power
  misc: atmel-ssc: lock with mutex instead of spinlock
  dmaengine: fsl-edma: Fix NULL pointer exception in fsl_edma_tx_handler
  intel_th: pci: Add Emmitsburg PCH support
  intel_th: pci: Add Tiger Lake PCH-H support
  intel_th: pci: Add Jasper Lake CPU support
  hwmon: (emc2103) fix unable to change fan pwm1_enable attribute
  MIPS: Fix build for LTS kernel caused by backporting lpj adjustment
  timer: Fix wheel index calculation on last level
  uio_pdrv_genirq: fix use without device tree and no interrupt
  Input: i8042 - add Lenovo XiaoXin Air 12 to i8042 nomux list
  mei: bus: don't clean driver pointer
  Revert "zram: convert remaining CLASS_ATTR() to CLASS_ATTR_RO()"
  fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
  virtio: virtio_console: add missing MODULE_DEVICE_TABLE() for rproc serial
  USB: serial: option: add Quectel EG95 LTE modem
  USB: serial: option: add GosunCn GM500 series
  USB: serial: ch341: add new Product ID for CH340
  USB: serial: cypress_m8: enable Simply Automated UPB PIM
  USB: serial: iuu_phoenix: fix memory corruption
  usb: gadget: function: fix missing spinlock in f_uac1_legacy
  usb: chipidea: core: add wakeup support for extcon
  usb: dwc2: Fix shutdown callback in platform
  USB: c67x00: fix use after free in c67x00_giveback_urb
  ALSA: usb-audio: Fix race against the error recovery URB submission
  ALSA: line6: Perform sanity check for each URB creation
  HID: magicmouse: do not set up autorepeat
  mtd: rawnand: oxnas: Release all devices in the _remove() path
  mtd: rawnand: oxnas: Unregister all devices on error
  mtd: rawnand: oxnas: Keep track of registered devices
  mtd: rawnand: brcmnand: fix CS0 layout
  perf stat: Zero all the 'ena' and 'run' array slot stats for interval mode
  copy_xstate_to_kernel: Fix typo which caused GDB regression
  ARM: dts: socfpga: Align L2 cache-controller nodename with dtschema
  Revert "thermal: mediatek: fix register index error"
  staging: comedi: verify array index is correct before using it
  usb: gadget: udc: atmel: fix uninitialized read in debug printk
  spi: spi-sun6i: sun6i_spi_transfer_one(): fix setting of clock rate
  arm64: dts: meson: add missing gxl rng clock
  phy: sun4i-usb: fix dereference of pointer phy0 before it is null checked
  iio:health:afe4404 Fix timestamp alignment and prevent data leak.
  ACPI: video: Use native backlight on Acer TravelMate 5735Z
  ACPI: video: Use native backlight on Acer Aspire 5783z
  mmc: sdhci: do not enable card detect interrupt for gpio cd type
  doc: dt: bindings: usb: dwc3: Update entries for disabling SS instances in park mode
  Revert "usb/xhci-plat: Set PM runtime as active on resume"
  Revert "usb/ehci-platform: Set PM runtime as active on resume"
  Revert "usb/ohci-platform: Fix a warning when hibernating"
  of: of_mdio: Correct loop scanning logic
  net: dsa: bcm_sf2: Fix node reference count
  spi: fix initial SPI_SR value in spi-fsl-dspi
  spi: spi-fsl-dspi: Fix lockup if device is shutdown during SPI transfer
  iio:health:afe4403 Fix timestamp alignment and prevent data leak.
  iio:pressure:ms5611 Fix buffer element alignment
  iio: pressure: zpa2326: handle pm_runtime_get_sync failure
  iio: mma8452: Add missed iio_device_unregister() call in mma8452_probe()
  iio: magnetometer: ak8974: Fix runtime PM imbalance on error
  iio:humidity:hdc100x Fix alignment and data leak issues
  iio:magnetometer:ak8974: Fix alignment and data leak issues
  arm64/alternatives: don't patch up internal branches
  arm64: alternative: Use true and false for boolean values
  i2c: eg20t: Load module automatically if ID matches
  gfs2: read-only mounts should grab the sd_freeze_gl glock
  tpm_tis: extra chip->ops check on error path in tpm_tis_core_init
  arm64/alternatives: use subsections for replacement sequences
  drm/exynos: fix ref count leak in mic_pre_enable
  cgroup: Fix sock_cgroup_data on big-endian.
  cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
  tcp: md5: do not send silly options in SYNCOOKIES
  tcp: make sure listeners don't initialize congestion-control state
  net_sched: fix a memory leak in atm_tc_init()
  tcp: md5: allow changing MD5 keys in all socket states
  tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers
  tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()
  net: usb: qmi_wwan: add support for Quectel EG95 LTE modem
  net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb
  llc: make sure applications use ARPHRD_ETHER
  l2tp: remove skb_dst_set() from l2tp_xmit_skb()
  ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg
  genetlink: remove genl_bind
  s390/mm: fix huge pte soft dirty copying
  ARC: elf: use right ELF_ARCH
  ARC: entry: fix potential EFA clobber when TIF_SYSCALL_TRACE
  dm: use noio when sending kobject event
  drm/radeon: fix double free
  btrfs: fix fatal extent_buffer readahead vs releasepage race
  Revert "ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb"
  KVM: x86: Mark CR4.TSD as being possibly owned by the guest
  KVM: x86: Inject #GP if guest attempts to toggle CR4.LA57 in 64-bit mode
  KVM: x86: bit 8 of non-leaf PDPEs is not reserved
  KVM: arm64: Stop clobbering x0 for HVC_SOFT_RESTART
  KVM: arm64: Fix definition of PAGE_HYP_DEVICE
  ALSA: usb-audio: add quirk for MacroSilicon MS2109
  ALSA: hda - let hs_mic be picked ahead of hp_mic
  ALSA: opl3: fix infoleak in opl3
  mlxsw: spectrum_router: Remove inappropriate usage of WARN_ON()
  net: macb: mark device wake capable when "magic-packet" property present
  bnxt_en: fix NULL dereference in case SR-IOV configuration fails
  nbd: Fix memory leak in nbd_add_socket
  arm64: kgdb: Fix single-step exception handling oops
  ALSA: compress: fix partial_drain completion state
  smsc95xx: avoid memory leak in smsc95xx_bind
  smsc95xx: check return value of smsc95xx_reset
  net: cxgb4: fix return error value in t4_prep_fw
  x86/entry: Increase entry_stack size to a full page
  nvme-rdma: assign completion vector correctly
  scsi: mptscsih: Fix read sense data size
  ARM: imx6: add missing put_device() call in imx6q_suspend_init()
  cifs: update ctime and mtime during truncate
  s390/kasan: fix early pgm check handler execution
  ixgbe: protect ring accesses with READ- and WRITE_ONCE
  spi: spidev: fix a potential use-after-free in spidev_release()
  spi: spidev: fix a race between spidev_release and spidev_remove
  gpu: host1x: Detach driver on unregister
  ARM: dts: omap4-droid4: Fix spi configuration and increase rate
  spi: spi-fsl-dspi: Fix external abort on interrupt in resume or exit paths
  spi: spi-fsl-dspi: use IRQF_SHARED mode to request IRQ
  spi: spi-fsl-dspi: Fix lockup if device is removed during SPI transfer
  spi: spi-fsl-dspi: Adding shutdown hook
  KVM: s390: reduce number of IO pins to 1
  UPSTREAM: perf/core: Fix crash when using HW tracing kernel filters
  ANDROID: fscrypt: fix DUN contiguity with inline encryption + IV_INO_LBLK_32 policies
  ANDROID: f2fs: add back compress inode check
  Linux 4.14.188
  efi: Make it possible to disable efivar_ssdt entirely
  dm zoned: assign max_io_len correctly
  irqchip/gic: Atomically update affinity
  MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
  cifs: Fix the target file was deleted when rename failed.
  SMB3: Honor persistent/resilient handle flags for multiuser mounts
  SMB3: Honor 'seal' flag for multiuser mounts
  Revert "ALSA: usb-audio: Improve frames size computation"
  nfsd: apply umask on fs without ACL support
  i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
  virtio-blk: free vblk-vqs in error path of virtblk_probe()
  drm: sun4i: hdmi: Remove extra HPD polling
  hwmon: (acpi_power_meter) Fix potential memory leak in acpi_power_meter_add()
  hwmon: (max6697) Make sure the OVERT mask is set correctly
  cxgb4: parse TC-U32 key values and masks natively
  cxgb4: use unaligned conversion for fetching timestamp
  crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock()
  kgdb: Avoid suspicious RCU usage warning
  usb: usbtest: fix missing kfree(dev->buf) in usbtest_disconnect
  mm/slub: fix stack overruns with SLUB_STATS
  mm/slub.c: fix corrupted freechain in deactivate_slab()
  usbnet: smsc95xx: Fix use-after-free after removal
  EDAC/amd64: Read back the scrub rate PCI register on F15h
  mm: fix swap cache node allocation mask
  btrfs: fix data block group relocation failure due to concurrent scrub
  btrfs: cow_file_range() num_bytes and disk_num_bytes are same
  btrfs: fix a block group ref counter leak after failure to remove block group
  UPSTREAM: binder: fix null deref of proc->context
  ANDROID: GKI: scripts: Makefile: update the lz4 command (#2)
  Linux 4.14.187
  Revert "tty: hvc: Fix data abort due to race in hvc_open"
  xfs: add agf freeblocks verify in xfs_agf_verify
  NFSv4 fix CLOSE not waiting for direct IO compeletion
  pNFS/flexfiles: Fix list corruption if the mirror count changes
  SUNRPC: Properly set the @subbuf parameter of xdr_buf_subsegment()
  sunrpc: fixed rollback in rpc_gssd_dummy_populate()
  Staging: rtl8723bs: prevent buffer overflow in update_sta_support_rate()
  drm/radeon: fix fb_div check in ni_init_smc_spll_table()
  tracing: Fix event trigger to accept redundant spaces
  arm64: perf: Report the PC value in REGS_ABI_32 mode
  ocfs2: fix panic on nfs server over ocfs2
  ocfs2: fix value of OCFS2_INVALID_SLOT
  ocfs2: load global_inode_alloc
  mm/slab: use memzero_explicit() in kzfree()
  btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
  KVM: nVMX: Plumb L2 GPA through to PML emulation
  KVM: X86: Fix MSR range of APIC registers in X2APIC mode
  ACPI: sysfs: Fix pm_profile_attr type
  ALSA: hda: Add NVIDIA codec IDs 9a & 9d through a0 to patch table
  blktrace: break out of blktrace setup on concurrent calls
  kbuild: improve cc-option to clean up all temporary files
  s390/ptrace: fix setting syscall number
  net: alx: fix race condition in alx_remove
  ata/libata: Fix usage of page address by page_address in ata_scsi_mode_select_xlat function
  sched/core: Fix PI boosting between RT and DEADLINE tasks
  net: bcmgenet: use hardware padding of runt frames
  netfilter: ipset: fix unaligned atomic access
  usb: gadget: udc: Potential Oops in error handling code
  ARM: imx5: add missing put_device() call in imx_suspend_alloc_ocram()
  net: qed: fix excessive QM ILT lines consumption
  net: qed: fix NVMe login fails over VFs
  net: qed: fix left elements count calculation
  RDMA/mad: Fix possible memory leak in ib_mad_post_receive_mads()
  ASoC: rockchip: Fix a reference count leak.
  RDMA/cma: Protect bind_list and listen_list while finding matching cm id
  rxrpc: Fix handling of rwind from an ACK packet
  ARM: dts: NSP: Correct FA2 mailbox node
  efi/esrt: Fix reference count leak in esre_create_sysfs_entry.
  cifs/smb3: Fix data inconsistent when zero file range
  cifs/smb3: Fix data inconsistent when punch hole
  xhci: Poll for U0 after disabling USB2 LPM
  ALSA: usb-audio: Fix OOB access of mixer element list
  ALSA: usb-audio: Clean up mixer element list traverse
  ALSA: usb-audio: uac1: Invalidate ctl on interrupt
  loop: replace kill_bdev with invalidate_bdev
  cdc-acm: Add DISABLE_ECHO quirk for Microchip/SMSC chip
  xhci: Fix enumeration issue when setting max packet size for FS devices.
  xhci: Fix incorrect EP_STATE_MASK
  ALSA: usb-audio: add quirk for Denon DCD-1500RE
  usb: host: ehci-exynos: Fix error check in exynos_ehci_probe()
  usb: host: xhci-mtk: avoid runtime suspend when removing hcd
  USB: ehci: reopen solution for Synopsys HC bug
  usb: add USB_QUIRK_DELAY_INIT for Logitech C922
  usb: dwc2: Postponed gadget registration to the udc class driver
  USB: ohci-sm501: Add missed iounmap() in remove
  net: core: reduce recursion limit value
  net: Do not clear the sock TX queue in sk_set_socket()
  net: Fix the arp error in some cases
  ip6_gre: fix use-after-free in ip6gre_tunnel_lookup()
  tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  ip_tunnel: fix use-after-free in ip_tunnel_lookup()
  tg3: driver sleeps indefinitely when EEH errors exceed eeh_max_freezes
  tcp: grow window for OOO packets only for SACK flows
  sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket
  rxrpc: Fix notification call on completion of discarded calls
  rocker: fix incorrect error handling in dma_rings_init
  net: usb: ax88179_178a: fix packet alignment padding
  net: fix memleak in register_netdevice()
  net: bridge: enfore alignment for ethernet address
  mld: fix memory leak in ipv6_mc_destroy_dev()
  ibmveth: Fix max MTU limit
  apparmor: don't try to replace stale label in ptraceme check
  fix a braino in "sparc32: fix register window handling in genregs32_[gs]et()"
  net: sched: export __netdev_watchdog_up()
  block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failed
  net: be more gentle about silly gso requests coming from user
  scsi: scsi_devinfo: handle non-terminated strings
  ANDROID: Makefile: append BUILD_NUMBER to version string when defined
  Linux 4.14.186
  KVM: x86/mmu: Set mmio_value to '0' if reserved #PF can't be generated
  kvm: x86: Fix reserved bits related calculation errors caused by MKTME
  kvm: x86: Move kvm_set_mmio_spte_mask() from x86.c to mmu.c
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  net: core: device_rename: Use rwsem instead of a seqcount
  sched/rt, net: Use CONFIG_PREEMPTION.patch
  kretprobe: Prevent triggering kretprobe from within kprobe_flush_task
  e1000e: Do not wake up the system via WOL if device wakeup is disabled
  kprobes: Fix to protect kick_kprobe_optimizer() by kprobe_mutex
  crypto: algboss - don't wait during notifier callback
  crypto: algif_skcipher - Cap recv SG list at ctx->used
  mtd: rawnand: tmio: Fix the probe error path
  mtd: rawnand: mtk: Fix the probe error path
  mtd: rawnand: plat_nand: Fix the probe error path
  mtd: rawnand: socrates: Fix the probe error path
  mtd: rawnand: oxnas: Fix the probe error path
  mtd: rawnand: oxnas: Add of_node_put()
  mtd: rawnand: orion: Fix the probe error path
  mtd: rawnand: xway: Fix the probe error path
  mtd: rawnand: sharpsl: Fix the probe error path
  mtd: rawnand: diskonchip: Fix the probe error path
  mtd: rawnand: Pass a nand_chip object to nand_release()
  block: nr_sects_write(): Disable preemption on seqcount write
  x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
  drm/dp_mst: Increase ACT retry timeout to 3s
  ext4: fix partial cluster initialization when splitting extent
  selinux: fix double free
  drm/qxl: Use correct notify port address when creating cursor ring
  drm/dp_mst: Reformat drm_dp_check_act_status() a bit
  drm: encoder_slave: fix refcouting error for modules
  libata: Use per port sync for detach
  arm64: hw_breakpoint: Don't invoke overflow handler on uaccess watchpoints
  block: Fix use-after-free in blkdev_get()
  bcache: fix potential deadlock problem in btree_gc_coalesce
  perf report: Fix NULL pointer dereference in hists__fprintf_nr_sample_events()
  usb/ehci-platform: Set PM runtime as active on resume
  usb/xhci-plat: Set PM runtime as active on resume
  scsi: acornscsi: Fix an error handling path in acornscsi_probe()
  drm/sun4i: hdmi ddc clk: Fix size of m divider
  selftests/net: in timestamping, strncpy needs to preserve null byte
  gfs2: fix use-after-free on transaction ail lists
  blktrace: fix endianness for blk_log_remap()
  blktrace: fix endianness in get_pdu_int()
  blktrace: use errno instead of bi_status
  selftests/vm/pkeys: fix alloc_random_pkey() to make it really random
  elfnote: mark all .note sections SHF_ALLOC
  include/linux/bitops.h: avoid clang shift-count-overflow warnings
  lib/zlib: remove outdated and incorrect pre-increment optimization
  geneve: change from tx_error to tx_dropped on missing metadata
  crypto: omap-sham - add proper load balancing support for multicore
  pinctrl: freescale: imx: Fix an error handling path in 'imx_pinctrl_probe()'
  pinctrl: imxl: Fix an error handling path in 'imx1_pinctrl_core_probe()'
  scsi: ufs: Don't update urgent bkops level when toggling auto bkops
  scsi: iscsi: Fix reference count leak in iscsi_boot_create_kobj
  gfs2: Allow lock_nolock mount to specify jid=X
  openrisc: Fix issue with argument clobbering for clone/fork
  vfio/mdev: Fix reference count leak in add_mdev_supported_type
  ASoC: fsl_asrc_dma: Fix dma_chan leak when config DMA channel failed
  extcon: adc-jack: Fix an error handling path in 'adc_jack_probe()'
  powerpc/4xx: Don't unmap NULL mbase
  NFSv4.1 fix rpc_call_done assignment for BIND_CONN_TO_SESSION
  net: sunrpc: Fix off-by-one issues in 'rpc_ntop6'
  scsi: ufs-qcom: Fix scheduling while atomic issue
  clk: bcm2835: Fix return type of bcm2835_register_gate
  x86/apic: Make TSC deadline timer detection message visible
  usb: gadget: Fix issue with config_ep_by_speed function
  usb: gadget: fix potential double-free in m66592_probe.
  usb: gadget: lpc32xx_udc: don't dereference ep pointer before null check
  USB: gadget: udc: s3c2410_udc: Remove pointless NULL check in s3c2410_udc_nuke
  usb: dwc2: gadget: move gadget resume after the core is in L0 state
  watchdog: da9062: No need to ping manually before setting timeout
  IB/cma: Fix ports memory leak in cma_configfs
  PCI/PTM: Inherit Switch Downstream Port PTM settings from Upstream Port
  dm zoned: return NULL if dmz_get_zone_for_reclaim() fails to find a zone
  powerpc/64s/pgtable: fix an undefined behaviour
  clk: samsung: exynos5433: Add IGNORE_UNUSED flag to sclk_i2s1
  tty: n_gsm: Fix bogus i++ in gsm_data_kick
  USB: host: ehci-mxc: Add error handling in ehci_mxc_drv_probe()
  drm/msm/mdp5: Fix mdp5_init error path for failed mdp5_kms allocation
  usb/ohci-platform: Fix a warning when hibernating
  vfio-pci: Mask cap zero
  powerpc/ps3: Fix kexec shutdown hang
  powerpc/pseries/ras: Fix FWNMI_VALID off by one
  tty: n_gsm: Fix waking up upper tty layer when room available
  tty: n_gsm: Fix SOF skipping
  PCI: Fix pci_register_host_bridge() device_register() error handling
  clk: ti: composite: fix memory leak
  dlm: remove BUG() before panic()
  scsi: mpt3sas: Fix double free warnings
  power: supply: smb347-charger: IRQSTAT_D is volatile
  power: supply: lp8788: Fix an error handling path in 'lp8788_charger_probe()'
  scsi: qla2xxx: Fix warning after FC target reset
  PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges
  PCI: rcar: Fix incorrect programming of OB windows
  drivers: base: Fix NULL pointer exception in __platform_driver_probe() if a driver developer is foolish
  serial: amba-pl011: Make sure we initialize the port.lock spinlock
  i2c: pxa: fix i2c_pxa_scream_blue_murder() debug output
  staging: sm750fb: add missing case while setting FB_VISUAL
  thermal/drivers/ti-soc-thermal: Avoid dereferencing ERR_PTR
  tty: hvc: Fix data abort due to race in hvc_open
  s390/qdio: put thinint indicator after early error
  ALSA: usb-audio: Improve frames size computation
  scsi: qedi: Do not flush offload work if ARP not resolved
  staging: greybus: fix a missing-check bug in gb_lights_light_config()
  scsi: ibmvscsi: Don't send host info in adapter info MAD after LPM
  scsi: sr: Fix sr_probe() missing deallocate of device minor
  apparmor: fix introspection of of task mode for unconfined tasks
  mksysmap: Fix the mismatch of '.L' symbols in System.map
  NTB: Fix the default port and peer numbers for legacy drivers
  yam: fix possible memory leak in yam_init_driver
  powerpc/crashkernel: Take "mem=" option into account
  nfsd: Fix svc_xprt refcnt leak when setup callback client failed
  powerpc/perf/hv-24x7: Fix inconsistent output values incase multiple hv-24x7 events run
  clk: clk-flexgen: fix clock-critical handling
  scsi: lpfc: Fix lpfc_nodelist leak when processing unsolicited event
  mfd: wm8994: Fix driver operation if loaded as modules
  m68k/PCI: Fix a memory leak in an error handling path
  vfio/pci: fix memory leaks in alloc_perm_bits()
  ps3disk: use the default segment boundary
  PCI: aardvark: Don't blindly enable ASPM L0s and don't write to read-only register
  dm mpath: switch paths in dm_blk_ioctl() code path
  usblp: poison URBs upon disconnect
  i2c: pxa: clear all master action bits in i2c_pxa_stop_message()
  f2fs: report delalloc reserve as non-free in statfs for project quota
  iio: bmp280: fix compensation of humidity
  scsi: qla2xxx: Fix issue with adapter's stopping state
  ALSA: isa/wavefront: prevent out of bounds write in ioctl
  scsi: qedi: Check for buffer overflow in qedi_set_path()
  ARM: integrator: Add some Kconfig selections
  ASoC: davinci-mcasp: Fix dma_chan refcnt leak when getting dma type
  backlight: lp855x: Ensure regulators are disabled on probe failure
  clk: qcom: msm8916: Fix the address location of pll->config_reg
  remoteproc: Fix IDR initialisation in rproc_alloc()
  iio: pressure: bmp280: Tolerate IRQ before registering
  i2c: piix4: Detect secondary SMBus controller on AMD AM4 chipsets
  clk: sunxi: Fix incorrect usage of round_down()
  power: supply: bq24257_charger: Replace depends on REGMAP_I2C with select
  drm/i915: Whitelist context-local timestamp in the gen9 cmdparser
  s390: fix syscall_get_error for compat processes
  ANDROID: ext4: Optimize match for casefolded encrypted dirs
  ANDROID: ext4: Handle casefolding with encryption
  ANDROID: cuttlefish_defconfig: x86: Enable KERNEL_LZ4
  ANDROID: GKI: scripts: Makefile: update the lz4 command
  FROMLIST: f2fs: fix use-after-free when accessing bio->bi_crypt_context
  Linux 4.14.185
  perf symbols: Fix debuginfo search for Ubuntu
  perf probe: Fix to check blacklist address correctly
  perf probe: Do not show the skipped events
  w1: omap-hdq: cleanup to add missing newline for some dev_dbg
  mtd: rawnand: pasemi: Fix the probe error path
  mtd: rawnand: brcmnand: fix hamming oob layout
  sunrpc: clean up properly in gss_mech_unregister()
  sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
  kbuild: force to build vmlinux if CONFIG_MODVERSION=y
  powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
  powerpc/64s: Don't let DT CPU features set FSCR_DSCR
  drivers/macintosh: Fix memleak in windfarm_pm112 driver
  ARM: tegra: Correct PL310 Auxiliary Control Register initialization
  kernel/cpu_pm: Fix uninitted local in cpu_pm
  dm crypt: avoid truncating the logical block size
  sparc64: fix misuses of access_process_vm() in genregs32_[sg]et()
  sparc32: fix register window handling in genregs32_[gs]et()
  pinctrl: samsung: Save/restore eint_mask over suspend for EINT_TYPE GPIOs
  power: vexpress: add suppress_bind_attrs to true
  igb: Report speed and duplex as unknown when device is runtime suspended
  media: ov5640: fix use of destroyed mutex
  b43_legacy: Fix connection problem with WPA3
  b43: Fix connection problem with WPA3
  b43legacy: Fix case where channel status is corrupted
  media: go7007: fix a miss of snd_card_free
  carl9170: remove P2P_GO support
  e1000e: Relax condition to trigger reset for ME workaround
  e1000e: Disable TSO for buffer overrun workaround
  PCI: Program MPS for RCiEP devices
  blk-mq: move _blk_mq_update_nr_hw_queues synchronize_rcu call
  btrfs: fix wrong file range cleanup after an error filling dealloc range
  btrfs: fix error handling when submitting direct I/O bio
  PCI: Unify ACS quirk desired vs provided checking
  PCI: Add ACS quirk for Intel Root Complex Integrated Endpoints
  PCI: Generalize multi-function power dependency device links
  vga_switcheroo: Use device link for HDA controller
  vga_switcheroo: Deduplicate power state tracking
  PCI: Make ACS quirk implementations more uniform
  PCI: Add ACS quirk for Ampere root ports
  PCI: Add ACS quirk for iProc PAXB
  PCI: Avoid FLR for AMD Starship USB 3.0
  PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0
  PCI: Disable MSI for Freescale Layerscape PCIe RC mode
  ext4: fix race between ext4_sync_parent() and rename()
  ext4: fix error pointer dereference
  ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max
  evm: Fix possible memory leak in evm_calc_hmac_or_hash()
  ima: Directly assign the ima_default_policy pointer to ima_rules
  ima: Fix ima digest hash table key calculation
  mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()
  btrfs: send: emit file capabilities after chown
  string.h: fix incompatibility between FORTIFY_SOURCE and KASAN
  platform/x86: hp-wmi: Convert simple_strtoul() to kstrtou32()
  cpuidle: Fix three reference count leaks
  spi: dw: Return any value retrieved from the dma_transfer callback
  mmc: sdhci-esdhc-imx: fix the mask for tuning start point
  ixgbe: fix signed-integer-overflow warning
  mmc: via-sdmmc: Respect the cmd->busy_timeout from the mmc core
  staging: greybus: sdio: Respect the cmd->busy_timeout from the mmc core
  mmc: sdhci-msm: Set SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk
  MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe()
  PCI: Don't disable decoding when mmio_always_on is set
  macvlan: Skip loopback packets in RX handler
  m68k: mac: Don't call via_flush_cache() on Mac IIfx
  x86/mm: Stop printing BRK addresses
  mips: Add udelay lpj numbers adjustment
  mips: MAAR: Use more precise address mask
  x86/boot: Correct relocation destination on old linkers
  mwifiex: Fix memory corruption in dump_station
  rtlwifi: Fix a double free in _rtl_usb_tx_urb_setup()
  md: don't flush workqueue unconditionally in md_open
  net: qed*: Reduce RX and TX default ring count when running inside kdump kernel
  wcn36xx: Fix error handling path in 'wcn36xx_probe()'
  nvme: refine the Qemu Identify CNS quirk
  kgdb: Fix spurious true from in_dbg_master()
  mips: cm: Fix an invalid error code of INTVN_*_ERR
  MIPS: Truncate link address into 32bit for 32bit kernel
  Crypto/chcr: fix for ccm(aes) failed test
  powerpc/spufs: fix copy_to_user while atomic
  net: allwinner: Fix use correct return type for ndo_start_xmit()
  media: cec: silence shift wrapping warning in __cec_s_log_addrs()
  net: lpc-enet: fix error return code in lpc_mii_init()
  exit: Move preemption fixup up, move blocking operations down
  lib/mpi: Fix 64-bit MIPS build with Clang
  net: bcmgenet: set Rx mode before starting netif
  netfilter: nft_nat: return EOPNOTSUPP if type or flags are not supported
  audit: fix a net reference leak in audit_list_rules_send()
  MIPS: Make sparse_init() using top-down allocation
  media: platform: fcp: Set appropriate DMA parameters
  media: dvb: return -EREMOTEIO on i2c transfer failure.
  audit: fix a net reference leak in audit_send_reply()
  dt-bindings: display: mediatek: control dpi pins mode to avoid leakage
  e1000: Distribute switch variables for initialization
  tools api fs: Make xxx__mountpoint() more scalable
  brcmfmac: fix wrong location to get firmware feature
  staging: android: ion: use vmap instead of vm_map_ram
  net: vmxnet3: fix possible buffer overflow caused by bad DMA value in vmxnet3_get_rss()
  x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
  spi: dw: Fix Rx-only DMA transfers
  ARM: 8978/1: mm: make act_mm() respect THREAD_SIZE
  btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
  clocksource: dw_apb_timer_of: Fix missing clockevent timers
  clocksource: dw_apb_timer: Make CPU-affiliation being optional
  spi: dw: Enable interrupts in accordance with DMA xfer mode
  kgdb: Prevent infinite recursive entries to the debugger
  Bluetooth: Add SCO fallback for invalid LMP parameters error
  MIPS: Loongson: Build ATI Radeon GPU driver as module
  ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K
  spi: dw: Zero DMA Tx and Rx configurations on stack
  net: ena: fix error returning in ena_com_get_hash_function()
  spi: pxa2xx: Apply CS clk quirk to BXT
  objtool: Ignore empty alternatives
  media: si2157: Better check for running tuner in init
  crypto: ccp -- don't "select" CONFIG_DMADEVICES
  drm: bridge: adv7511: Extend list of audio sample rates
  ACPI: GED: use correct trigger type field in _Exx / _Lxx handling
  xen/pvcalls-back: test for errors when calling backend_connect()
  can: kvaser_usb: kvaser_usb_leaf: Fix some info-leaks to USB devices
  mmc: sdio: Fix potential NULL pointer error in mmc_sdio_init_card()
  mmc: sdhci-msm: Clear tuning done flag while hs400 tuning
  agp/intel: Reinforce the barrier after GTT updates
  perf: Add cond_resched() to task_function_call()
  fat: don't allow to mount if the FAT length == 0
  mm/slub: fix a memory leak in sysfs_slab_add()
  Smack: slab-out-of-bounds in vsscanf
  ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb
  ath9x: Fix stack-out-of-bounds Write in ath9k_hif_usb_rx_cb
  ath9k: Fix use-after-free Write in ath9k_htc_rx_msg
  ath9k: Fix use-after-free Read in ath9k_wmi_ctrl_rx
  KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
  KVM: MIPS: Fix VPN2_MASK definition for variable cpu_vmbits
  KVM: MIPS: Define KVM_ENTRYHI_ASID to cpu_asid_mask(&boot_cpu_data)
  KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
  KVM: nSVM: leave ASID aside in copy_vmcb_control_area
  KVM: nSVM: fix condition for filtering async PF
  video: fbdev: w100fb: Fix a potential double free.
  proc: Use new_inode not new_inode_pseudo
  ovl: initialize error in ovl_copy_xattr
  selftests/net: in rxtimestamp getopt_long needs terminating null entry
  crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req()
  crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()
  crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()
  spi: bcm2835: Fix controller unregister order
  spi: pxa2xx: Fix controller unregister order
  spi: Fix controller unregister order
  spi: No need to assign dummy value in spi_unregister_controller()
  spi: dw: Fix controller unregister order
  spi: dw: fix possible race condition
  x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.
  x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.
  x86/speculation: Add support for STIBP always-on preferred mode
  x86/speculation: Change misspelled STIPB to STIBP
  KVM: x86: only do L1TF workaround on affected processors
  KVM: x86/mmu: Consolidate "is MMIO SPTE" code
  kvm: x86: Fix L1TF mitigation for shadow MMU
  ALSA: pcm: disallow linking stream to itself
  crypto: cavium/nitrox - Fix 'nitrox_get_first_device()' when ndevlist is fully iterated
  spi: bcm-qspi: when tx/rx buffer is NULL set to 0
  spi: bcm2835aux: Fix controller unregister order
  nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
  cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages
  ACPI: PM: Avoid using power resources if there are none for D0
  ACPI: GED: add support for _Exx / _Lxx handler methods
  ACPI: CPPC: Fix reference count leak in acpi_cppc_processor_probe()
  ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile()
  ALSA: usb-audio: Fix inconsistent card PM state after resume
  ALSA: hda/realtek - add a pintbl quirk for several Lenovo machines
  ALSA: es1688: Add the missed snd_card_free()
  efi/efivars: Add missing kobject_put() in sysfs entry creation error path
  x86/reboot/quirks: Add MacBook6,1 reboot quirk
  x86/speculation: Prevent rogue cross-process SSBD shutdown
  x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs
  x86_64: Fix jiffies ODR violation
  mm: add kvfree_sensitive() for freeing sensitive data objects
  perf probe: Accept the instance number of kretprobe event
  ath9k_htc: Silence undersized packet warnings
  powerpc/xive: Clear the page tables for the ESB IO mapping
  drivers/net/ibmvnic: Update VNIC protocol version reporting
  Input: synaptics - add a second working PNP_ID for Lenovo T470s
  sched/fair: Don't NUMA balance for kthreads
  ARM: 8977/1: ptrace: Fix mask for thumb breakpoint hook
  crypto: talitos - fix ECB and CBC algs ivsize
  serial: imx: Fix handling of TC irq in combination with DMA
  lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user()
  x86: uaccess: Inhibit speculation past access_ok() in user_access_begin()
  arch/openrisc: Fix issues with access_ok()
  Fix 'acccess_ok()' on alpha and SH
  make 'user_access_begin()' do 'access_ok()'
  vxlan: Avoid infinite loop when suppressing NS messages with invalid options
  ipv6: fix IPV6_ADDRFORM operation logic
  writeback: Drop I_DIRTY_TIME_EXPIRE
  writeback: Fix sync livelock due to b_dirty_time processing
  writeback: Avoid skipping inode writeback
  writeback: Protect inode->i_io_list with inode->i_lock
  Revert "writeback: Avoid skipping inode writeback"
  ANDROID: Enable LZ4_RAMDISK
  fscrypt: remove stale definition
  fs-verity: remove unnecessary extern keywords
  fs-verity: fix all kerneldoc warnings
  fscrypt: add support for IV_INO_LBLK_32 policies
  fscrypt: make test_dummy_encryption use v2 by default
  fscrypt: support test_dummy_encryption=v2
  fscrypt: add fscrypt_add_test_dummy_key()
  linux/parser.h: add include guards
  fscrypt: remove unnecessary extern keywords
  fscrypt: name all function parameters
  fscrypt: fix all kerneldoc warnings
  ANDROID: kbuild: merge more sections with LTO
  Linux 4.14.184
  uprobes: ensure that uprobe->offset and ->ref_ctr_offset are properly aligned
  iio: vcnl4000: Fix i2c swapped word reading.
  x86/speculation: Add Ivy Bridge to affected list
  x86/speculation: Add SRBDS vulnerability and mitigation documentation
  x86/speculation: Add Special Register Buffer Data Sampling (SRBDS) mitigation
  x86/cpu: Add 'table' argument to cpu_matches()
  x86/cpu: Add a steppings field to struct x86_cpu_id
  nvmem: qfprom: remove incorrect write support
  CDC-ACM: heed quirk also in error handling
  staging: rtl8712: Fix IEEE80211_ADDBA_PARAM_BUF_SIZE_MASK
  tty: hvc_console, fix crashes on parallel open/close
  vt: keyboard: avoid signed integer overflow in k_ascii
  usb: musb: Fix runtime PM imbalance on error
  usb: musb: start session in resume for host port
  USB: serial: option: add Telit LE910C1-EUX compositions
  USB: serial: usb_wwan: do not resubmit rx urb on fatal errors
  USB: serial: qcserial: add DW5816e QDL support
  l2tp: add sk_family checks to l2tp_validate_socket
  net: check untrusted gso_size at kernel entry
  vsock: fix timeout in vsock_accept()
  NFC: st21nfca: add missed kfree_skb() in an error path
  net: usb: qmi_wwan: add Telit LE910C1-EUX composition
  l2tp: do not use inet_hash()/inet_unhash()
  devinet: fix memleak in inetdev_init()
  airo: Fix read overflows sending packets
  scsi: ufs: Release clock if DMA map fails
  mmc: fix compilation of user API
  kernel/relay.c: handle alloc_percpu returning NULL in relay_open
  p54usb: add AirVasT USB stick device-id
  HID: i2c-hid: add Schneider SCL142ALM to descriptor override
  HID: sony: Fix for broken buttons on DS3 USB dongles
  mm: Fix mremap not considering huge pmd devmap
  net: smsc911x: Fix runtime PM imbalance on error
  net: ethernet: stmmac: Enable interface clocks on probe for IPQ806x
  net/ethernet/freescale: rework quiesce/activate for ucc_geth
  net: bmac: Fix read of MAC address from ROM
  x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables
  i2c: altera: Fix race between xfer_msg and isr thread
  ARC: [plat-eznps]: Restrict to CONFIG_ISA_ARCOMPACT
  ARC: Fix ICCM & DCCM runtime size checks
  pppoe: only process PADT targeted at local interfaces
  s390/ftrace: save traced function caller
  spi: dw: use "smp_mb()" to avoid sending spi data error
  scsi: hisi_sas: Check sas_port before using it
  libnvdimm: Fix endian conversion issues 
  scsi: scsi_devinfo: fixup string compare
  ANDROID: Incremental fs: Remove dependency on PKCS7_MESSAGE_PARSER
  f2fs: attach IO flags to the missing cases
  f2fs: add node_io_flag for bio flags likewise data_io_flag
  f2fs: remove unused parameter of f2fs_put_rpages_mapping()
  f2fs: handle readonly filesystem in f2fs_ioc_shutdown()
  f2fs: avoid utf8_strncasecmp() with unstable name
  f2fs: don't return vmalloc() memory from f2fs_kmalloc()
  ANDROID: dm-bow: Add block_size option
  ANDROID: Incremental fs: Cache successful hash calculations
  ANDROID: Incremental fs: Fix four error-path bugs
  ANDROID: cuttlefish_defconfig: Disable CMOS RTC driver
  f2fs: fix retry logic in f2fs_write_cache_pages()
  ANDROID: modules: fix lockprove warning
  BACKPORT: arm64: vdso: Explicitly add build-id option
  BACKPORT: arm64: vdso: use $(LD) instead of $(CC) to link VDSO
  Linux 4.14.183
  scsi: zfcp: fix request object use-after-free in send path causing wrong traces
  genirq/generic_pending: Do not lose pending affinity update
  net: hns: Fixes the missing put_device in positive leg for roce reset
  net: hns: fix unsigned comparison to less than zero
  KVM: VMX: check for existence of secondary exec controls before accessing
  rxrpc: Fix transport sockopts to get IPv4 errors on an IPv6 socket
  sc16is7xx: move label 'err_spi' to correct section
  mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap()
  netfilter: nf_conntrack_pptp: fix compilation warning with W=1 build
  bonding: Fix reference count leak in bond_sysfs_slave_add.
  qlcnic: fix missing release in qlcnic_83xx_interrupt_test.
  esp6: get the right proto for transport mode in esp6_gso_encap
  netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code
  netfilter: nfnetlink_cthelper: unbreak userspace helper support
  netfilter: ipset: Fix subcounter update skip
  netfilter: nft_reject_bridge: enable reject with bridge vlan
  ip_vti: receive ipip packet by calling ip_tunnel_rcv
  vti4: eliminated some duplicate code.
  xfrm: fix error in comment
  xfrm: fix a NULL-ptr deref in xfrm_local_error
  xfrm: fix a warning in xfrm_policy_insert_list
  xfrm: call xfrm_output_gso when inner_protocol is set in xfrm_output
  xfrm: allow to accept packets with ipv6 NEXTHDR_HOP in xfrm_input
  copy_xstate_to_kernel(): don't leave parts of destination uninitialized
  x86/dma: Fix max PFN arithmetic overflow on 32 bit systems
  mac80211: mesh: fix discovery timer re-arming issue / crash
  parisc: Fix kernel panic in mem_init()
  iommu: Fix reference count leak in iommu_group_alloc.
  include/asm-generic/topology.h: guard cpumask_of_node() macro argument
  fs/binfmt_elf.c: allocate initialized memory in fill_thread_core_info()
  mm: remove VM_BUG_ON(PageSlab()) from page_mapcount()
  libceph: ignore pool overlay and cache logic on redirects
  ALSA: hda/realtek - Add new codec supported for ALC287
  exec: Always set cap_ambient in cap_bprm_set_creds
  ALSA: usb-audio: mixer: volume quirk for ESS Technology Asus USB DAC
  ALSA: hwdep: fix a left shifting 1 by 31 UB bug
  RDMA/pvrdma: Fix missing pci disable in pvrdma_pci_probe()
  mmc: block: Fix use-after-free issue for rpmb
  ARM: dts: bcm2835-rpi-zero-w: Fix led polarity
  ARM: dts/imx6q-bx50v3: Set display interface clock parents
  ARM: dts: imx6q-bx50v3: Add internal switch
  IB/qib: Call kobject_put() when kobject_init_and_add() fails
  gpio: exar: Fix bad handling for ida_simple_get error path
  ARM: uaccess: fix DACR mismatch with nested exceptions
  ARM: uaccess: integrate uaccess_save and uaccess_restore
  ARM: uaccess: consolidate uaccess asm to asm/uaccess-asm.h
  ARM: 8843/1: use unified assembler in headers
  Input: synaptics-rmi4 - fix error return code in rmi_driver_probe()
  Input: synaptics-rmi4 - really fix attn_data use-after-free
  Input: i8042 - add ThinkPad S230u to i8042 reset list
  Input: dlink-dir685-touchkeys - fix a typo in driver name
  Input: xpad - add custom init packet for Xbox One S controllers
  Input: evdev - call input_flush_device() on release(), not flush()
  Input: usbtouchscreen - add support for BonXeon TP
  samples: bpf: Fix build error
  cifs: Fix null pointer check in cifs_read
  net: freescale: select CONFIG_FIXED_PHY where needed
  usb: gadget: legacy: fix redundant initialization warnings
  cachefiles: Fix race between read_waiter and read_copier involving op->to_do
  gfs2: move privileged user check to gfs2_quota_lock_check
  net: microchip: encx24j600: add missed kthread_stop
  gpio: tegra: mask GPIO IRQs during IRQ shutdown
  ARM: dts: rockchip: fix pinctrl sub nodename for spi in rk322x.dtsi
  arm64: dts: rockchip: swap interrupts interrupt-names rk3399 gpu node
  ARM: dts: rockchip: fix phy nodename for rk3228-evb
  net/mlx4_core: fix a memory leak bug.
  net: sun: fix missing release regions in cas_init_one().
  net: qrtr: Fix passing invalid reference to qrtr_local_enqueue()
  net/mlx5e: Update netdev txq on completions during closure
  sctp: Start shutdown on association restart if in SHUTDOWN-SENT state and socket is closed
  r8152: support additional Microsoft Surface Ethernet Adapter variant
  net sched: fix reporting the first-time use timestamp
  net: revert "net: get rid of an signed integer overflow in ip_idents_reserve()"
  net/mlx5: Add command entry handling completion
  net: ipip: fix wrong address family in init error path
  ax25: fix setsockopt(SO_BINDTODEVICE)
  ANDROID: scs: fix recursive spinlock in scs_check_usage
  ANDROID: timer: fix timer_setup with CFI
  FROMGIT: USB: dummy-hcd: use configurable endpoint naming scheme
  UPSTREAM: USB: dummy-hcd: remove unsupported isochronous endpoints
  UPSTREAM: usb: raw-gadget: fix null-ptr-deref when reenabling endpoints
  UPSTREAM: usb: raw-gadget: documentation updates
  UPSTREAM: usb: raw-gadget: support stalling/halting/wedging endpoints
  UPSTREAM: usb: raw-gadget: fix gadget endpoint selection
  UPSTREAM: usb: raw-gadget: improve uapi headers comments
  UPSTREAM: usb: raw-gadget: fix return value of ep read ioctls
  UPSTREAM: usb: raw-gadget: fix raw_event_queue_fetch locking
  UPSTREAM: usb: raw-gadget: Fix copy_to/from_user() checks
  f2fs: fix wrong discard space
  f2fs: compress: don't compress any datas after cp stop
  f2fs: remove unneeded return value of __insert_discard_tree()
  f2fs: fix wrong value of tracepoint parameter
  f2fs: protect new segment allocation in expand_inode_data
  f2fs: code cleanup by removing ifdef macro surrounding
  writeback: Avoid skipping inode writeback
  ANDROID: net: bpf: permit redirect from ingress L3 to egress L2 devices at near max mtu
  Revert "ANDROID: Incremental fs: Avoid continually recalculating hashes"
  Linux 4.14.182
  iio: adc: stm32-adc: fix device used to request dma
  iio: adc: stm32-adc: Use dma_request_chan() instead dma_request_slave_channel()
  x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasks
  rxrpc: Fix a memory leak in rxkad_verify_response()
  rapidio: fix an error in get_user_pages_fast() error handling
  mei: release me_cl object reference
  iio: dac: vf610: Fix an error handling path in 'vf610_dac_probe()'
  iio: sca3000: Remove an erroneous 'get_device()'
  staging: greybus: Fix uninitialized scalar variable
  staging: iio: ad2s1210: Fix SPI reading
  Revert "gfs2: Don't demote a glock until its revokes are written"
  cxgb4/cxgb4vf: Fix mac_hlist initialization and free
  cxgb4: free mac_hlist properly
  media: fdp1: Fix R-Car M3-N naming in debug message
  libnvdimm/btt: Fix LBA masking during 'free list' population
  libnvdimm/btt: Remove unnecessary code in btt_freelist_init
  ubsan: build ubsan.c more conservatively
  x86/uaccess, ubsan: Fix UBSAN vs. SMAP
  powerpc/64s: Disable STRICT_KERNEL_RWX
  powerpc: Remove STRICT_KERNEL_RWX incompatibility with RELOCATABLE
  powerpc: restore alphabetic order in Kconfig
  dmaengine: tegra210-adma: Fix an error handling path in 'tegra_adma_probe()'
  apparmor: Fix aa_label refcnt leak in policy_update
  ALSA: pcm: fix incorrect hw_base increase
  ALSA: iec1712: Initialize STDSP24 properly when using the model=staudio option
  l2tp: initialise PPP sessions before registering them
  l2tp: protect sock pointer of struct pppol2tp_session with RCU
  l2tp: initialise l2tp_eth sessions before registering them
  l2tp: don't register sessions in l2tp_session_create()
  arm64: fix the flush_icache_range arguments in machine_kexec
  padata: purge get_cpu and reorder_via_wq from padata_do_serial
  padata: initialize pd->cpu with effective cpumask
  padata: Replace delayed timer with immediate workqueue in padata_reorder
  padata: set cpu_index of unused CPUs to -1
  ARM: futex: Address build warning
  platform/x86: asus-nb-wmi: Do not load on Asus T100TA and T200TA
  USB: core: Fix misleading driver bug report
  ceph: fix double unlock in handle_cap_export()
  gtp: set NLM_F_MULTI flag in gtp_genl_dump_pdp()
  x86/apic: Move TSC deadline timer debug printk
  scsi: ibmvscsi: Fix WARN_ON during event pool release
  component: Silence bind error on -EPROBE_DEFER
  vhost/vsock: fix packet delivery order to monitoring devices
  configfs: fix config_item refcnt leak in configfs_rmdir()
  scsi: qla2xxx: Fix hang when issuing nvme disconnect-all in NPIV
  HID: multitouch: add eGalaxTouch P80H84 support
  gcc-common.h: Update for GCC 10
  ubi: Fix seq_file usage in detailed_erase_block_info debugfs file
  i2c: mux: demux-pinctrl: Fix an error handling path in 'i2c_demux_pinctrl_probe()'
  iommu/amd: Fix over-read of ACPI UID from IVRS table
  fix multiplication overflow in copy_fdtable()
  ima: Fix return value of ima_write_policy()
  evm: Check also if *tfm is an error pointer in init_desc()
  ima: Set file->f_mode instead of file->f_flags in ima_calc_file_hash()
  padata: ensure padata_do_serial() runs on the correct CPU
  padata: ensure the reorder timer callback runs on the correct CPU
  i2c: dev: Fix the race between the release of i2c_dev and cdev
  watchdog: Fix the race between the release of watchdog_core_data and cdev
  ext4: add cond_resched() to ext4_protect_reserved_inode
  ANDROID: scsi: ufs: Handle clocks when lrbp fails
  ANDROID: fscrypt: handle direct I/O with IV_INO_LBLK_32
  BACKPORT: FROMLIST: fscrypt: add support for IV_INO_LBLK_32 policies
  f2fs: avoid inifinite loop to wait for flushing node pages at cp_error
  ANDROID: namespace'ify tcp_default_init_rwnd implementation
  Linux 4.14.181
  Makefile: disallow data races on gcc-10 as well
  KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mce
  ARM: dts: r8a7740: Add missing extal2 to CPG node
  ARM: dts: r8a73a4: Add missing CMT1 interrupts
  arm64: dts: rockchip: Rename dwc3 device nodes on rk3399 to make dtc happy
  arm64: dts: rockchip: Replace RK805 PMIC node name with "pmic" on rk3328 boards
  Revert "ALSA: hda/realtek: Fix pop noise on ALC225"
  usb: gadget: legacy: fix error return code in cdc_bind()
  usb: gadget: legacy: fix error return code in gncm_bind()
  usb: gadget: audio: Fix a missing error return value in audio_bind()
  usb: gadget: net2272: Fix a memory leak in an error handling path in 'net2272_plat_probe()'
  clk: rockchip: fix incorrect configuration of rk3228 aclk_gpu* clocks
  exec: Move would_dump into flush_old_exec
  x86/unwind/orc: Fix error handling in __unwind_start()
  usb: xhci: Fix NULL pointer dereference when enqueuing trbs from urb sg list
  USB: gadget: fix illegal array access in binding with UDC
  usb: host: xhci-plat: keep runtime active when removing host
  usb: core: hub: limit HUB_QUIRK_DISABLE_AUTOSUSPEND to USB5534B
  ALSA: usb-audio: Add control message quirk delay for Kingston HyperX headset
  x86: Fix early boot crash on gcc-10, third try
  ARM: dts: imx27-phytec-phycard-s-rdk: Fix the I2C1 pinctrl entries
  ARM: dts: dra7: Fix bus_dma_limit for PCIe
  ALSA: rawmidi: Fix racy buffer resize under concurrent accesses
  ALSA: rawmidi: Initialize allocated buffers
  ALSA: hda/realtek - Limit int mic boost for Thinkpad T530
  net: tcp: fix rx timestamp behavior for tcp_recvmsg
  netprio_cgroup: Fix unlimited memory leak of v2 cgroups
  net: ipv4: really enforce backoff for redirects
  net: dsa: loop: Add module soft dependency
  hinic: fix a bug of ndo_stop
  Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu"
  net: phy: fix aneg restart in phy_ethtool_set_eee
  netlabel: cope with NULL catmap
  net: fix a potential recursive NETDEV_FEAT_CHANGE
  net: phy: micrel: Use strlcpy() for ethtool::get_strings
  x86/asm: Add instruction suffixes to bitops
  gcc-10: avoid shadowing standard library 'free()' in crypto
  gcc-10: disable 'restrict' warning for now
  gcc-10: disable 'stringop-overflow' warning for now
  gcc-10: disable 'array-bounds' warning for now
  gcc-10: disable 'zero-length-bounds' warning for now
  Stop the ad-hoc games with -Wno-maybe-initialized
  kbuild: compute false-positive -Wmaybe-uninitialized cases in Kconfig
  gcc-10 warnings: fix low-hanging fruit
  pnp: Use list_for_each_entry() instead of open coding
  hwmon: (da9052) Synchronize access with mfd
  IB/mlx4: Test return value of calls to ib_get_cached_pkey
  netfilter: conntrack: avoid gcc-10 zero-length-bounds warning
  i40iw: Fix error handling in i40iw_manage_arp_cache()
  pinctrl: cherryview: Add missing spinlock usage in chv_gpio_irq_handler
  pinctrl: baytrail: Enable pin configuration setting for GPIO chip
  ipmi: Fix NULL pointer dereference in ssif_probe
  x86/entry/64: Fix unwind hints in register clearing code
  ALSA: hda/realtek - Fix S3 pop noise on Dell Wyse
  ipc/util.c: sysvipc_find_ipc() incorrectly updates position index
  drm/qxl: lost qxl_bo_kunmap_atomic_page in qxl_image_init_helper()
  ALSA: hda/hdmi: fix race in monitor detection during probe
  cpufreq: intel_pstate: Only mention the BIOS disabling turbo mode once
  dmaengine: mmp_tdma: Reset channel error on release
  dmaengine: pch_dma.c: Avoid data race between probe and irq handler
  scsi: sg: add sg_remove_request in sg_write
  virtio-blk: handle block_device_operations callbacks after hot unplug
  drop_monitor: work around gcc-10 stringop-overflow warning
  net: moxa: Fix a potential double 'free_irq()'
  net/sonic: Fix a resource leak in an error handling path in 'jazz_sonic_probe()'
  shmem: fix possible deadlocks on shmlock_user_lock
  net: stmmac: Use mutex instead of spinlock
  f2fs: fix to avoid memory leakage in f2fs_listxattr
  f2fs: fix to avoid accessing xattr across the boundary
  f2fs: sanity check of xattr entry size
  f2fs: introduce read_xattr_block
  f2fs: introduce read_inline_xattr
  blktrace: fix dereference after null check
  blktrace: Protect q->blk_trace with RCU
  blktrace: fix trace mutex deadlock
  blktrace: fix unlocked access to init/start-stop/teardown
  net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup
  net: ipv6: add net argument to ip6_dst_lookup_flow
  scripts/decodecode: fix trapping instruction formatting
  objtool: Fix stack offset tracking for indirect CFAs
  netfilter: nat: never update the UDP checksum when it's 0
  x86/unwind/orc: Fix error path for bad ORC entry type
  x86/unwind/orc: Prevent unwinding before ORC initialization
  x86/unwind/orc: Don't skip the first frame for inactive tasks
  x86/entry/64: Fix unwind hints in rewind_stack_do_exit()
  x86/entry/64: Fix unwind hints in kernel exit path
  batman-adv: Fix refcnt leak in batadv_v_ogm_process
  batman-adv: Fix refcnt leak in batadv_store_throughput_override
  batman-adv: Fix refcnt leak in batadv_show_throughput_override
  batman-adv: fix batadv_nc_random_weight_tq
  coredump: fix crash when umh is disabled
  mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous()
  KVM: arm: vgic: Fix limit condition when writing to GICD_I[CS]ACTIVER
  tracing: Add a vmalloc_sync_mappings() for safe measure
  USB: serial: garmin_gps: add sanity checking for data length
  USB: uas: add quirk for LaCie 2Big Quadra
  HID: usbhid: Fix race between usbhid_close() and usbhid_stop()
  geneve: only configure or fill UDP_ZERO_CSUM6_RX/TX info when CONFIG_IPV6
  HID: wacom: Read HID_DG_CONTACTMAX directly for non-generic devices
  ipv6: fix cleanup ordering for ip6_mr failure
  net: stricter validation of untrusted gso packets
  bnxt_en: Fix VF anti-spoof filter setup.
  bnxt_en: Improve AER slot reset.
  net/mlx5: Fix command entry leak in Internal Error State
  net/mlx5: Fix forced completion access non initialized command entry
  bnxt_en: Fix VLAN acceleration handling in bnxt_fix_features().
  sch_sfq: validate silly quantum values
  sch_choke: avoid potential panic in choke_reset()
  net: usb: qmi_wwan: add support for DW5816e
  net/mlx4_core: Fix use of ENOSPC around mlx4_counter_alloc()
  net: macsec: preserve ingress frame ordering
  fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checks
  dp83640: reverse arguments to list_add_tail
  USB: serial: qcserial: Add DW5816e support
  f2fs: compress: fix zstd data corruption
  f2fs: add compressed/gc data read IO stat
  f2fs: fix potential use-after-free issue
  f2fs: compress: don't handle non-compressed data in workqueue
  f2fs: remove redundant assignment to variable err
  f2fs: refactor resize_fs to avoid meta updates in progress
  f2fs: use round_up to enhance calculation
  f2fs: introduce F2FS_IOC_RESERVE_COMPRESS_BLOCKS
  f2fs: Avoid double lock for cp_rwsem during checkpoint
  f2fs: report delalloc reserve as non-free in statfs for project quota
  f2fs: Fix wrong stub helper update_sit_info
  f2fs: compress: let lz4 compressor handle output buffer budget properly
  f2fs: remove blk_plugging in block_operations
  f2fs: introduce F2FS_IOC_RELEASE_COMPRESS_BLOCKS
  f2fs: shrink spinlock coverage
  f2fs: correctly fix the parent inode number during fsync()
  f2fs: introduce mempool for {,de}compress intermediate page allocation
  f2fs: introduce f2fs_bmap_compress()
  f2fs: support fiemap on compressed inode
  f2fs: support partial truncation on compressed inode
  f2fs: remove redundant compress inode check
  f2fs: flush dirty meta pages when flushing them
  f2fs: use strcmp() in parse_options()
  f2fs: fix checkpoint=disable:%u%%
  f2fs: Use the correct style for SPDX License Identifier
  f2fs: rework filename handling
  f2fs: split f2fs_d_compare() from f2fs_match_name()
  f2fs: don't leak filename in f2fs_try_convert_inline_dir()
  ANDROID: clang: update to 11.0.1
  FROMLIST: x86_64: fix jiffies ODR violation
  ANDROID: cuttlefish_defconfig: Enable net testing options
  ANDROID: Incremental fs: wake up log pollers less often
  ANDROID: Incremental fs: Fix scheduling while atomic error
  ANDROID: Incremental fs: Avoid continually recalculating hashes
  Revert "f2fs: refactor resize_fs to avoid meta updates in progress"
  UPSTREAM: HID: steam: Fix input device disappearing
  ANDROID: fscrypt: set dun_bytes more precisely
  ANDROID: dm-default-key: set dun_bytes more precisely
  ANDROID: block: backport the ability to specify max_dun_bytes
  ANDROID: hid: steam: remove BT controller matching
  ANDROID: dm-default-key: Update key size for wrapped keys
  ANDROID: cuttlefish_defconfig: Enable CONFIG_STATIC_USERMODEHELPER
  ANDROID: cuttlefish_defconfig: enable CONFIG_MMC_CRYPTO
  ANDROID: Add padding for crypto related structs in UFS and MMC
  ANDROID: mmc: MMC crypto API
  f2fs: fix missing check for f2fs_unlock_op
  f2fs: refactor resize_fs to avoid meta updates in progress

 Conflicts:
	Documentation/devicetree/bindings/usb/dwc3.txt
	drivers/block/virtio_blk.c
	drivers/mmc/core/Kconfig
	drivers/mmc/core/block.c
	drivers/mmc/host/sdhci-msm.c
	drivers/net/ethernet/stmicro/stmmac/stmmac.h
	drivers/net/ethernet/stmicro/stmmac/stmmac_ethtool.c
	drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
	drivers/scsi/ufs/ufs-qcom.c
	drivers/usb/gadget/composite.c
	drivers/usb/gadget/function/f_uac1_legacy.c
	fs/crypto/crypto.c
	fs/crypto/inline_crypt.c
	fs/crypto/keyring.c
	fs/f2fs/checkpoint.c
	include/linux/fs.h
	include/linux/mmc/host.h
	include/linux/mod_devicetable.h
	include/uapi/linux/input-event-codes.h
	net/qrtr/qrtr.c
	sound/core/compress_offload.c
	sound/core/rawmidi.c

Fixed build errors:
	drivers/scsi/ufs/ufshcd.c

Change-Id: I2add911b58d3c87b666ffa0fe46cbceb6cc56430
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
2020-09-06 01:12:33 +05:30
Greg Kroah-Hartman
fa6adf6dea This is the 4.14.188 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl8GyQEACgkQONu9yGCS
 aT6M6A/+KZUxvf9eqthOw08YylvhSr5LpWestSWlvBB7KVJaqSW1aKi/qCGXr+zV
 5Go3J5089Zu8M2nKn/1bq2AvfugLa0+xYtCy/NkN8T/4ozB/ar/ZLMGh4LKbiDeZ
 Pz5pmZEajdmwNyfxqfkg5GgOfm1JTAjicnuxJZHa86L5JqkV0axC7/Crh+RPnJqc
 1QiHMdrzrcYkms53EPhRIOfBuS63Y3GeNse3LW8AQpzWRME0O6qHQqD1lmj3126Y
 ad7VeK4s0VDHQVz0opBhyJ3yC+SSKDvjjt+78qJc7WzZJBH3YqMzTqqgU2LVkuUW
 ZM9YYtooyg+Nax8+c1+jEOyvyCybsd0rAGhQzzX6jBoPL07tnvbOG5zEI73lW0R7
 +E3sIIgGlx8IJbHEk7Xiidkh5Eo0bBpFaLM7auGlmklYdWOjaL4rnF4QH+u/UP93
 oBJoOhD1vBHmptYGr45m6oUlP0IMfxRh50jJlp3tUgOjj/BPNm9iq0OsdC2doCXd
 YOPiiw+q7cpxY1/XgdJH7RF04r4HZOj/t6JZ3Bd7odR7ycbO2VNSex0W3mJfIWXX
 T78PFcFThyXXHVBjyTPgOLFwCLPS/rKVB6018W9c3ztNb5Gq1IaXTi+YHm4Jfpb1
 qjb5ZtRSQREmL80pPt8NqL1BoNvakSN50srYKk51KYmXZQ+YdWg=
 =5Z2k
 -----END PGP SIGNATURE-----

Merge 4.14.188 into android-4.14-stable

Changes in 4.14.188
	btrfs: fix a block group ref counter leak after failure to remove block group
	btrfs: cow_file_range() num_bytes and disk_num_bytes are same
	btrfs: fix data block group relocation failure due to concurrent scrub
	mm: fix swap cache node allocation mask
	EDAC/amd64: Read back the scrub rate PCI register on F15h
	usbnet: smsc95xx: Fix use-after-free after removal
	mm/slub.c: fix corrupted freechain in deactivate_slab()
	mm/slub: fix stack overruns with SLUB_STATS
	usb: usbtest: fix missing kfree(dev->buf) in usbtest_disconnect
	kgdb: Avoid suspicious RCU usage warning
	crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock()
	cxgb4: use unaligned conversion for fetching timestamp
	cxgb4: parse TC-U32 key values and masks natively
	hwmon: (max6697) Make sure the OVERT mask is set correctly
	hwmon: (acpi_power_meter) Fix potential memory leak in acpi_power_meter_add()
	drm: sun4i: hdmi: Remove extra HPD polling
	virtio-blk: free vblk-vqs in error path of virtblk_probe()
	i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
	nfsd: apply umask on fs without ACL support
	Revert "ALSA: usb-audio: Improve frames size computation"
	SMB3: Honor 'seal' flag for multiuser mounts
	SMB3: Honor persistent/resilient handle flags for multiuser mounts
	cifs: Fix the target file was deleted when rename failed.
	MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
	irqchip/gic: Atomically update affinity
	dm zoned: assign max_io_len correctly
	efi: Make it possible to disable efivar_ssdt entirely
	Linux 4.14.188

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I048e9500ecbd19082316f125d67d22cc48bc07f2
2020-07-09 11:15:36 +02:00
Hugh Dickins
eba151dc93 mm: fix swap cache node allocation mask
[ Upstream commit 243bce09c91b0145aeaedd5afba799d81841c030 ]

Chris Murphy reports that a slightly overcommitted load, testing swap
and zram along with i915, splats and keeps on splatting, when it had
better fail less noisily:

  gnome-shell: page allocation failure: order:0,
  mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
  nodemask=(null),cpuset=/,mems_allowed=0
  CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
  Call Trace:
    dump_stack+0x64/0x88
    warn_alloc.cold+0x75/0xd9
    __alloc_pages_slowpath.constprop.0+0xcfa/0xd30
    __alloc_pages_nodemask+0x2df/0x320
    alloc_slab_page+0x195/0x310
    allocate_slab+0x3c5/0x440
    ___slab_alloc+0x40c/0x5f0
    __slab_alloc+0x1c/0x30
    kmem_cache_alloc+0x20e/0x220
    xas_nomem+0x28/0x70
    add_to_swap_cache+0x321/0x400
    __read_swap_cache_async+0x105/0x240
    swap_cluster_readahead+0x22c/0x2e0
    shmem_swapin+0x8e/0xc0
    shmem_swapin_page+0x196/0x740
    shmem_getpage_gfp+0x3a2/0xa60
    shmem_read_mapping_page_gfp+0x32/0x60
    shmem_get_pages+0x155/0x5e0 [i915]
    __i915_gem_object_get_pages+0x68/0xa0 [i915]
    i915_vma_pin+0x3fe/0x6c0 [i915]
    eb_add_vma+0x10b/0x2c0 [i915]
    i915_gem_do_execbuffer+0x704/0x3430 [i915]
    i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
    drm_ioctl_kernel+0x86/0xd0 [drm]
    drm_ioctl+0x206/0x390 [drm]
    ksys_ioctl+0x82/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x5b/0xf0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported on 5.7, but it goes back really to 3.1: when
shmem_read_mapping_page_gfp() was implemented for use by i915, and
allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
__read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=208085
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Chris Murphy <lists@colorremedies.com>
Analyzed-by: Vlastimil Babka <vbabka@suse.cz>
Analyzed-by: Matthew Wilcox <willy@infradead.org>
Tested-by: Chris Murphy <lists@colorremedies.com>
Cc: <stable@vger.kernel.org>	[3.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09 09:36:30 +02:00
Blagovest Kolenichev
070370f0ae Merge android-4.14.108 (4344de2) into msm-4.14
* refs/heads/tmp-4344de2:
  Linux 4.14.108
  s390/setup: fix boot crash for machine without EDAT-1
  KVM: nVMX: Ignore limit checks on VMX instructions using flat segments
  KVM: nVMX: Apply addr size mask to effective address for VMX instructions
  KVM: nVMX: Sign extend displacements of VMX instr's mem operands
  KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux
  KVM: x86/mmu: Detect MMIO generation wrap in any address space
  KVM: Call kvm_arch_memslots_updated() before updating memslots
  drm/radeon/evergreen_cs: fix missing break in switch statement
  media: imx: csi: Stop upstream before disabling IDMA channel
  media: imx: csi: Disable CSI immediately after last EOF
  media: vimc: Add vimc-streamer for stream control
  media: uvcvideo: Avoid NULL pointer dereference at the end of streaming
  media: imx: prpencvf: Stop upstream before disabling IDMA channel
  rcu: Do RCU GP kthread self-wakeup from softirq and interrupt
  tpm: Unify the send callback behaviour
  tpm/tpm_crb: Avoid unaligned reads in crb_recv()
  md: Fix failed allocation of md_register_thread
  perf intel-pt: Fix divide by zero when TSC is not available
  perf intel-pt: Fix overlap calculation for padding
  perf auxtrace: Define auxtrace record alignment
  perf intel-pt: Fix CYC timestamp calculation after OVF
  x86/unwind/orc: Fix ORC unwind table alignment
  bcache: never writeback a discard operation
  PM / wakeup: Rework wakeup source timer cancellation
  NFSv4.1: Reinitialise sequence results before retransmitting a request
  nfsd: fix wrong check in write_v4_end_grace()
  nfsd: fix memory corruption caused by readdir
  NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
  NFS: Fix an I/O request leakage in nfs_do_recoalesce
  NFS: Fix I/O request leakages
  cpcap-charger: generate events for userspace
  dm integrity: limit the rate of error messages
  dm: fix to_sector() for 32bit
  arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
  arm64: debug: Ensure debug handlers check triggering exception level
  arm64: Fix HCR.TGE status for NMI contexts
  ARM: s3c24xx: Fix boolean expressions in osiris_dvs_notify
  powerpc/traps: Fix the message printed when stack overflows
  powerpc/traps: fix recoverability of machine check handling on book3s/32
  powerpc/hugetlb: Don't do runtime allocation of 16G pages in LPAR configuration
  powerpc/ptrace: Simplify vr_get/set() to avoid GCC warning
  powerpc: Fix 32-bit KVM-PR lockup and host crash with MacOS guest
  powerpc/83xx: Also save/restore SPRG4-7 during suspend
  powerpc/powernv: Make opal log only readable by root
  powerpc/wii: properly disable use of BATs when requested.
  powerpc/32: Clear on-stack exception marker upon exception return
  security/selinux: fix SECURITY_LSM_NATIVE_LABELS on reused superblock
  jbd2: fix compile warning when using JBUFFER_TRACE
  jbd2: clear dirty flag when revoking a buffer from an older transaction
  serial: 8250_pci: Have ACCES cards that use the four port Pericom PI7C9X7954 chip use the pci_pericom_setup()
  serial: 8250_pci: Fix number of ports for ACCES serial cards
  serial: 8250_of: assume reg-shift of 2 for mrvl,mmp-uart
  serial: uartps: Fix stuck ISR if RX disabled with non-empty FIFO
  drm/i915: Relax mmap VMA check
  crypto: arm64/aes-neonbs - fix returning final keystream block
  i2c: tegra: fix maximum transfer size
  parport_pc: fix find_superio io compare code, should use equal test.
  intel_th: Don't reference unassigned outputs
  device property: Fix the length used in PROPERTY_ENTRY_STRING()
  kernel/sysctl.c: add missing range check in do_proc_dointvec_minmax_conv
  mm/vmalloc: fix size check for remap_vmalloc_range_partial()
  mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
  nfit: acpi_nfit_ctl(): Check out_obj->type in the right place
  usb: chipidea: tegra: Fix missed ci_hdrc_remove_device()
  clk: ingenic: Fix doc of ingenic_cgu_div_info
  clk: ingenic: Fix round_rate misbehaving with non-integer dividers
  clk: clk-twl6040: Fix imprecise external abort for pdmclk
  clk: uniphier: Fix update register for CPU-gear
  ext2: Fix underflow in ext2_max_size()
  cxl: Wrap iterations over afu slices inside 'afu_list_lock'
  IB/hfi1: Close race condition on user context disable and close
  ext4: fix crash during online resizing
  ext4: add mask of ext4 flags to swap
  cpufreq: pxa2xx: remove incorrect __init annotation
  cpufreq: tegra124: add missing of_node_put()
  x86/kprobes: Prohibit probing on optprobe template code
  irqchip/gic-v3-its: Avoid parsing _indirect_ twice for Device table
  libertas_tf: don't set URB_ZERO_PACKET on IN USB transfer
  crypto: pcbc - remove bogus memcpy()s with src == dest
  Btrfs: fix corruption reading shared and compressed extents after hole punching
  btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
  Btrfs: setup a nofs context for memory allocation at __btrfs_set_acl
  m68k: Add -ffreestanding to CFLAGS
  splice: don't merge into linked buffers
  fs/devpts: always delete dcache dentry-s in dput()
  scsi: target/iscsi: Avoid iscsit_release_commands_from_conn() deadlock
  scsi: sd: Optimal I/O size should be a multiple of physical block size
  scsi: aacraid: Fix performance issue on logical drives
  scsi: virtio_scsi: don't send sc payload with tmfs
  s390/virtio: handle find on invalid queue gracefully
  s390/setup: fix early warning messages
  clocksource/drivers/exynos_mct: Clear timer interrupt when shutdown
  clocksource/drivers/exynos_mct: Move one-shot check from tick clear to ISR
  regulator: s2mpa01: Fix step values for some LDOs
  regulator: max77620: Initialize values for DT properties
  regulator: s2mps11: Fix steps for buck7, buck8 and LDO35
  spi: pxa2xx: Setup maximum supported DMA transfer length
  spi: ti-qspi: Fix mmap read when more than one CS in use
  mmc: sdhci-esdhc-imx: fix HS400 timing issue
  ACPI / device_sysfs: Avoid OF modalias creation for removed device
  xen: fix dom0 boot on huge systems
  tracing: Do not free iter->trace in fail path of tracing_open_pipe()
  tracing: Use strncpy instead of memcpy for string keys in hist triggers
  CIFS: Fix read after write for files with read caching
  CIFS: Do not reset lease state to NONE on lease break
  crypto: arm64/aes-ccm - fix bugs in non-NEON fallback routine
  crypto: arm64/aes-ccm - fix logical bug in AAD MAC handling
  crypto: testmgr - skip crc32c context test for ahash algorithms
  crypto: hash - set CRYPTO_TFM_NEED_KEY if ->setkey() fails
  crypto: arm64/crct10dif - revert to C code for short inputs
  crypto: arm/crct10dif - revert to C code for short inputs
  fix cgroup_do_mount() handling of failure exits
  libnvdimm: Fix altmap reservation size calculation
  libnvdimm/pmem: Honor force_raw for legacy pmem regions
  libnvdimm, pfn: Fix over-trim in trim_pfn_device()
  libnvdimm/label: Clear 'updating' flag after label-set update
  stm class: Prevent division by zero
  media: videobuf2-v4l2: drop WARN_ON in vb2_warn_zero_bytesused()
  tmpfs: fix uninitialized return value in shmem_link
  net: set static variable an initial value in atl2_probe()
  nfp: bpf: fix ALU32 high bits clearance bug
  nfp: bpf: fix code-gen bug on BPF_ALU | BPF_XOR | BPF_K
  net: thunderx: make CFG_DONE message to run through generic send-ack sequence
  mac80211_hwsim: propagate genlmsg_reply return code
  phonet: fix building with clang
  ARCv2: support manual regfile save on interrupts
  ARC: uacces: remove lp_start, lp_end from clobber list
  ARCv2: lib: memcpy: fix doing prefetchw outside of buffer
  ixgbe: fix older devices that do not support IXGBE_MRQC_L3L4TXSWEN
  tmpfs: fix link accounting when a tmpfile is linked in
  net: marvell: mvneta: fix DMA debug warning
  arm64: Relax GIC version check during early boot
  qed: Fix iWARP syn packet mac address validation.
  ASoC: topology: free created components in tplg load error
  mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
  net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
  qmi_wwan: apply SET_DTR quirk to Sierra WP7607
  pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
  net: systemport: Fix reception of BPDUs
  scsi: libiscsi: Fix race between iscsi_xmit_task and iscsi_complete_task
  keys: Fix dependency loop between construction record and auth key
  assoc_array: Fix shortcut creation
  af_key: unconditionally clone on broadcast
  ARM: 8824/1: fix a migrating irq bug when hotplug cpu
  esp: Skip TX bytes accounting when sending from a request socket
  clk: sunxi: A31: Fix wrong AHB gate number
  clk: sunxi-ng: v3s: Fix TCON reset de-assert bit
  Input: st-keyscan - fix potential zalloc NULL dereference
  auxdisplay: ht16k33: fix potential user-after-free on module unload
  i2c: bcm2835: Clear current buffer pointers and counts after a transfer
  i2c: cadence: Fix the hold bit setting
  net: hns: Fix object reference leaks in hns_dsaf_roce_reset()
  mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs
  Revert "mm: use early_pfn_to_nid in page_ext_init"
  mm/gup: fix gup_pmd_range() for dax
  NFS: Don't use page_file_mapping after removing the page
  floppy: check_events callback should not return a negative number
  ipvs: fix dependency on nf_defrag_ipv6
  mac80211: Fix Tx aggregation session tear down with ITXQs
  Input: matrix_keypad - use flush_delayed_work()
  Input: ps2-gpio - flush TX work when closing port
  Input: cap11xx - switch to using set_brightness_blocking()
  ARM: OMAP2+: fix lack of timer interrupts on CPU1 after hotplug
  KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded
  ASoC: rsnd: fixup rsnd_ssi_master_clk_start() user count check
  ASoC: dapm: fix out-of-bounds accesses to DAPM lookup tables
  ARM: OMAP2+: Variable "reg" in function omap4_dsi_mux_pads() could be uninitialized
  Input: pwm-vibra - stop regulator after disabling pwm, not before
  Input: pwm-vibra - prevent unbalanced regulator
  s390/dasd: fix using offset into zero size array error
  gpu: ipu-v3: Fix CSI offsets for imx53
  drm/imx: imx-ldb: add missing of_node_puts
  gpu: ipu-v3: Fix i.MX51 CSI control registers offset
  drm/imx: ignore plane updates on disabled crtcs
  crypto: rockchip - update new iv to device in multiple operations
  crypto: rockchip - fix scatterlist nents error
  crypto: ahash - fix another early termination in hash walk
  crypto: caam - fixed handling of sg list
  stm class: Fix an endless loop in channel allocation
  iio: adc: exynos-adc: Fix NULL pointer exception on unbind
  ASoC: fsl_esai: fix register setting issue in RIGHT_J mode
  9p/net: fix memory leak in p9_client_create
  9p: use inode->i_lock to protect i_size_write() under 32-bit
  FROMLIST: psi: introduce psi monitor
  FROMLIST: refactor header includes to allow kthread.h inclusion in psi_types.h
  FROMLIST: psi: track changed states
  FROMLIST: psi: split update_stats into parts
  FROMLIST: psi: rename psi fields in preparation for psi trigger addition
  FROMLIST: psi: make psi_enable static
  FROMLIST: psi: introduce state_mask to represent stalled psi states
  ANDROID: cuttlefish_defconfig: Enable CONFIG_INPUT_MOUSEDEV
  ANDROID: cuttlefish_defconfig: Enable CONFIG_PSI
  BACKPORT: kernel: cgroup: add poll file operation
  BACKPORT: fs: kernfs: add poll file operation
  UPSTREAM: psi: avoid divide-by-zero crash inside virtual machines
  UPSTREAM: psi: clarify the Kconfig text for the default-disable option
  UPSTREAM: psi: fix aggregation idle shut-off
  UPSTREAM: psi: fix reference to kernel commandline enable
  UPSTREAM: psi: make disabling/enabling easier for vendor kernels
  UPSTREAM: kernel/sched/psi.c: simplify cgroup_move_task()
  BACKPORT: psi: cgroup support
  UPSTREAM: psi: pressure stall information for CPU, memory, and IO
  UPSTREAM: sched: introduce this_rq_lock_irq()
  UPSTREAM: sched: sched.h: make rq locking and clock functions available in stats.h
  UPSTREAM: sched: loadavg: make calc_load_n() public
  BACKPORT: sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
  UPSTREAM: delayacct: track delays from thrashing cache pages
  UPSTREAM: mm: workingset: tell cache transitions from workingset thrashing
  sched/fair: fix energy compute when a cluster is only a cpu core in multi-cluster system

Conflicts:
	arch/arm/kernel/irq.c
	drivers/scsi/sd.c
	include/linux/sched.h
	include/uapi/linux/taskstats.h
	kernel/sched/Makefile
	sound/soc/soc-dapm.c

Change-Id: I12ebb57a34da9101ee19458d7e1f96ecc769c39a
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2019-05-15 07:44:57 -07:00
Johannes Weiner
1d54f31271 UPSTREAM: mm: workingset: tell cache transitions from workingset thrashing
Refaults happen during transitions between workingsets as well as in-place
thrashing.  Knowing the difference between the two has a range of
applications, including measuring the impact of memory shortage on the
system performance, as well as the ability to smarter balance pressure
between the filesystem cache and the swap-backed workingset.

During workingset transitions, inactive cache refaults and pushes out
established active cache.  When that active cache isn't stale, however,
and also ends up refaulting, that's bonafide thrashing.

Introduce a new page flag that tells on eviction whether the page has been
active or not in its lifetime.  This bit is then stored in the shadow
entry, to classify refaults as transitioning or thrashing.

How many page->flags does this leave us with on 32-bit?

	20 bits are always page flags

	21 if you have an MMU

	23 with the zone bits for DMA, Normal, HighMem, Movable

	29 with the sparsemem section bits

	30 if PAE is enabled

	31 with this patch.

So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
that's not enough, the system can switch to discontigmem and re-gain the 6
or 7 sparsemem section bits.

Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit 8508cf3ffad4defa202b303e5b6379efc4cd9054)

Bug: 127712811
Test: lmkd in PSI mode
Change-Id: I71df060dce5590a3c654f9a0e8e54deeb74b64c2
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-03-21 16:19:55 -07:00
Laurent Dufour
4153099359 mm: protect VMA modifications using VMA sequence count
The VMA sequence count has been introduced to allow fast detection of
VMA modification when running a page fault handler without holding
the mmap_sem.

This patch provides protection against the VMA modification done in :
	- madvise()
	- mpol_rebind_policy()
	- vma_replace_policy()
	- change_prot_numa()
	- mlock(), munlock()
	- mprotect()
	- mmap_region()
	- collapse_huge_page()
	- userfaultd registering services

In addition, VMA fields which will be read during the speculative fault
path needs to be written using WRITE_ONCE to prevent write to be split
and intermediate values to be pushed to other CPUs.

Change-Id: Ic36046b7254e538b6baf7144c50ae577ee7f2074
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Patch-mainline: linux-mm @ Tue, 17 Apr 2018 16:33:15
[vinmenon@codeaurora.org: trivial merge conflict fixes]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2018-06-04 15:40:51 +05:30
Minchan Kim
88484e159f mm: swap: unify cluster-based and vma-based swap readahead
This patch makes do_swap_page() not need to be aware of two different swap
readahead algorithms.  Just unify cluster-based and vma-based readahead
function call.

Change-Id: I45eeedb6347c245fbdb38744ac484119ade07d9c
Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Git-commit: 9d1b4ad90ada396540a10ab5691671aa4c229a1d
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2018-05-22 11:24:18 +05:30
Changbin Du
aebacdaceb mm/swap_state.c: declare a few variables as __read_mostly
These global variables are only set during initialization or rarely
change, so declare them as __read_mostly.

Change-Id: Iefa78448381eae5bb0e08eb7a5e0e61cc251a4b3
Link: http://lkml.kernel.org/r/1507802349-5554-1-git-send-email-changbin.du@intel.com
Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Git-commit: 783cb68ee2d25d621326366c0b615bf2ccf3b402
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2018-05-22 11:24:17 +05:30
Minchan Kim
909b06426e mm: swap: clean up swap readahead
When I see recent change of swap readahead, I am very unhappy about
current code structure which diverges two swap readahead algorithm in
do_swap_page.  This patch is to clean it up.

Main motivation is that fault handler doesn't need to be aware of
readahead algorithms but just should call swapin_readahead.

As first step, this patch cleans up a little bit but not perfect (I just
separate for review easier) so next patch will make the goal complete.

Change-Id: Ia5420f4deb5f02e2f3e286ee2d9008a0cff4eeb4
Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Git-commit: f80207727aaca3aa34a9cd80659393534de69cad
Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2018-05-22 11:24:17 +05:30
Greg Kroah-Hartman
b24413180f License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier.  The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
 - file had no licensing information it it.
 - file was a */uapi/* one with no licensing information in it,
 - file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne.  Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed.  Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
 - Files considered eligible had to be source code files.
 - Make and config files were included as candidates if they contained >5
   lines of source
 - File already had some variant of a license header in it (even if <5
   lines).

All documentation files were explicitly excluded.

The following heuristics were used to determine which SPDX license
identifiers to apply.

 - when both scanners couldn't find any license traces, file was
   considered to have no license information in it, and the top level
   COPYING file license applied.

   For non */uapi/* files that summary was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0                                              11139

   and resulted in the first patch in this series.

   If that file was a */uapi/* path one, it was "GPL-2.0 WITH
   Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:

   SPDX license identifier                            # files
   ---------------------------------------------------|-------
   GPL-2.0 WITH Linux-syscall-note                        930

   and resulted in the second patch in this series.

 - if a file had some form of licensing information in it, and was one
   of the */uapi/* ones, it was denoted with the Linux-syscall-note if
   any GPL family license was found in the file or had no licensing in
   it (per prior point).  Results summary:

   SPDX license identifier                            # files
   ---------------------------------------------------|------
   GPL-2.0 WITH Linux-syscall-note                       270
   GPL-2.0+ WITH Linux-syscall-note                      169
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
   ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
   LGPL-2.1+ WITH Linux-syscall-note                      15
   GPL-1.0+ WITH Linux-syscall-note                       14
   ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
   LGPL-2.0+ WITH Linux-syscall-note                       4
   LGPL-2.1 WITH Linux-syscall-note                        3
   ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
   ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1

   and that resulted in the third patch in this series.

 - when the two scanners agreed on the detected license(s), that became
   the concluded license(s).

 - when there was disagreement between the two scanners (one detected a
   license but the other didn't, or they both detected different
   licenses) a manual inspection of the file occurred.

 - In most cases a manual inspection of the information in the file
   resulted in a clear resolution of the license that should apply (and
   which scanner probably needed to revisit its heuristics).

 - When it was not immediately clear, the license identifier was
   confirmed with lawyers working with the Linux Foundation.

 - If there was any question as to the appropriate license identifier,
   the file was flagged for further research and to be revisited later
   in time.

In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.

Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights.  The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.

Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.

In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.

Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
 - a full scancode scan run, collecting the matched texts, detected
   license ids and scores
 - reviewing anything where there was a license detected (about 500+
   files) to ensure that the applied SPDX license was correct
 - reviewing anything where there was no detection but the patch license
   was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
   SPDX license was correct

This produced a worksheet with 20 files needing minor correction.  This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.

These .csv files were then reviewed by Greg.  Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected.  This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.)  Finally Greg ran the script using the .csv files to
generate the patches.

Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02 11:10:55 +01:00
Huang Ying
61b639723b mm, swap: use page-cluster as max window of VMA based swap readahead
When the VMA based swap readahead was introduced, a new knob

  /sys/kernel/mm/swap/vma_ra_max_order

was added as the max window of VMA swap readahead.  This is to make it
possible to use different max window for VMA based readahead and
original physical readahead.  But Minchan Kim pointed out that this will
cause a regression because setting page-cluster sysctl to zero cannot
disable swap readahead with the change.

To fix the regression, the page-cluster sysctl is used as the max window
of both the VMA based swap readahead and original physical swap
readahead.  If more fine grained control is needed in the future, more
knobs can be added as the subordinate knobs of the page-cluster sysctl.

The vma_ra_max_order knob is deleted.  Because the knob was introduced
in v4.14-rc1, and this patch is targeting being merged before v4.14
releasing, there should be no existing users of this newly added ABI.

Link: http://lkml.kernel.org/r/20171011070847.16003-1-ying.huang@intel.com
Fixes: ec560175c0b6fce ("mm, swap: VMA based swap readahead")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-13 16:18:33 -07:00
Shaohua Li
9625456cc7 mm: fix data corruption caused by lazyfree page
MADV_FREE clears pte dirty bit and then marks the page lazyfree (clear
SwapBacked).  There is no lock to prevent the page is added to swap
cache between these two steps by page reclaim.  If page reclaim finds
such page, it will simply add the page to swap cache without pageout the
page to swap because the page is marked as clean.  Next time, page fault
will read data from the swap slot which doesn't have the original data,
so we have a data corruption.  To fix issue, we mark the page dirty and
pageout the page.

However, we shouldn't dirty all pages which is clean and in swap cache.
swapin page is swap cache and clean too.  So we only dirty page which is
added into swap cache in page reclaim, which shouldn't be swapin page.
As Minchan suggested, simply dirty the page in add_to_swap can do the
job.

Fixes: 802a3a92ad7a ("mm: reclaim MADV_FREE pages")
Link: http://lkml.kernel.org/r/08c84256b007bf3f63c91d94383bd9eb6fee2daa.1506446061.git.shli@fb.com
Signed-off-by: Shaohua Li <shli@fb.com>
Reported-by: Artem Savkov <asavkov@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: <stable@vger.kernel.org>	[4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-03 17:54:25 -07:00
Huang Ying
d9bfcfdc41 mm, swap: add sysfs interface for VMA based swap readahead
The sysfs interface to control the VMA based swap readahead is added as
follow,

/sys/kernel/mm/swap/vma_ra_enabled

Enable the VMA based swap readahead algorithm, or use the original
global swap readahead algorithm.

/sys/kernel/mm/swap/vma_ra_max_order

Set the max order of the readahead window size for the VMA based swap
readahead algorithm.

The corresponding ABI documentation is added too.

Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Huang Ying
ec560175c0 mm, swap: VMA based swap readahead
The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory.  And the different tasks in the system may have different access
patterns, which makes the global space locality estimation incorrect.

In this patch, when page fault occurs, the virtual pages near the fault
address will be readahead instead of the swap slots near the fault swap
slot in swap device.  This avoid to readahead the unrelated swap slots.
At the same time, the swap readahead is changed to work on per-VMA from
globally.  So that the different access patterns of the different VMAs
could be distinguished, and the different readahead policy could be
applied accordingly.  The original core readahead detection and scaling
algorithm is reused, because it is an effect algorithm to detect the
space locality.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random swap-in
in the background.

This is a combined workload with sequential and random memory accessing
at the same time.  The result (for sequential workload) is as follow,

			Base		Optimized
			----		---------
throughput		345413 KB/s	414029 KB/s (+19.9%)
latency.average		97.14 us	61.06 us (-37.1%)
latency.50th		2 us		1 us
latency.60th		2 us		1 us
latency.70th		98 us		2 us
latency.80th		160 us		2 us
latency.90th		260 us		217 us
latency.95th		346 us		369 us
latency.99th		1.34 ms		1.09 ms
ra_hit%			52.69%		99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

			Base		Optimized
			----		---------
elapsed_time		393.49 s	329.88 s (-16.2%)
ra_hit%			86.21%		98.82%

The score of base and optimized kernel hasn't visible changes.  But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages.  And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.

Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Huang Ying
c4fa63092f mm, swap: fix swap readahead marking
In the original implementation, it is possible that the existing pages
in the swap cache (not newly readahead) could be marked as the readahead
pages.  This will cause the statistics of swap readahead be wrong and
influence the swap readahead algorithm too.

This is fixed via marking a page as the readahead page only if it is
newly allocated and read from the disk.

When testing with linpack, after the fixing the swap readahead hit rate
increased from ~66% to ~86%.

Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Huang Ying
cbc65df240 mm, swap: add swap readahead hit statistics
Patch series "mm, swap: VMA based swap readahead", v4.

The swap readahead is an important mechanism to reduce the swap in
latency.  Although pure sequential memory access pattern isn't very
popular for anonymous memory, the space locality is still considered
valid.

In the original swap readahead implementation, the consecutive blocks in
swap device are readahead based on the global space locality estimation.
But the consecutive blocks in swap device just reflect the order of page
reclaiming, don't necessarily reflect the access pattern in virtual
memory space.  And the different tasks in the system may have different
access patterns, which makes the global space locality estimation
incorrect.

In this patchset, when page fault occurs, the virtual pages near the
fault address will be readahead instead of the swap slots near the fault
swap slot in swap device.  This avoid to readahead the unrelated swap
slots.  At the same time, the swap readahead is changed to work on
per-VMA from globally.  So that the different access patterns of the
different VMAs could be distinguished, and the different readahead
policy could be applied accordingly.  The original core readahead
detection and scaling algorithm is reused, because it is an effect
algorithm to detect the space locality.

In addition to the swap readahead changes, some new sysfs interface is
added to show the efficiency of the readahead algorithm and some other
swap statistics.

This new implementation will incur more small random read, on SSD, the
improved correctness of estimation and readahead target should beat the
potential increased overhead, this is also illustrated in the test
results below.  But on HDD, the overhead may beat the benefit, so the
original implementation will be used by default.

The test and result is as follow,

Common test condition
=====================

Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
Swap device: NVMe disk

Micro-benchmark with combined access pattern
============================================

vm-scalability, sequential swap test case, 4 processes to eat 50G
virtual memory space, repeat the sequential memory writing until 300
seconds.  The first round writing will trigger swap out, the following
rounds will trigger sequential swap in and out.

At the same time, run vm-scalability random swap test case in
background, 8 processes to eat 30G virtual memory space, repeat the
random memory write until 300 seconds.  This will trigger random swap-in
in the background.

This is a combined workload with sequential and random memory accessing
at the same time.  The result (for sequential workload) is as follow,

			Base		Optimized
			----		---------
throughput		345413 KB/s	414029 KB/s (+19.9%)
latency.average		97.14 us	61.06 us (-37.1%)
latency.50th		2 us		1 us
latency.60th		2 us		1 us
latency.70th		98 us		2 us
latency.80th		160 us		2 us
latency.90th		260 us		217 us
latency.95th		346 us		369 us
latency.99th		1.34 ms		1.09 ms
ra_hit%			52.69%		99.98%

The original swap readahead algorithm is confused by the background
random access workload, so readahead hit rate is lower.  The VMA-base
readahead algorithm works much better.

Linpack
=======

The test memory size is bigger than RAM to trigger swapping.

			Base		Optimized
			----		---------
elapsed_time		393.49 s	329.88 s (-16.2%)
ra_hit%			86.21%		98.82%

The score of base and optimized kernel hasn't visible changes.  But the
elapsed time reduced and readahead hit rate improved, so the optimized
kernel runs better for startup and tear down stages.  And the absolute
value of readahead hit rate is high, shows that the space locality is
still valid in some practical workloads.

This patch (of 5):

The statistics for total readahead pages and total readahead hits are
recorded and exported via the following sysfs interface.

/sys/kernel/mm/swap/ra_hits
/sys/kernel/mm/swap/ra_total

With them, the efficiency of the swap readahead could be measured, so
that the swap readahead algorithm and parameters could be tuned
accordingly.

[akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Shaohua Li
23955622ff swap: add block io poll in swapin path
For fast flash disk, async IO could introduce overhead because of
context switch.  block-mq now supports IO poll, which improves
performance and latency a lot.  swapin is a good place to use this
technique, because the task is waiting for the swapin page to continue
execution.

In my virtual machine, directly read 4k data from a NVMe with iopoll is
about 60% better than that without poll.  With iopoll support in swapin
patch, my microbenchmark (a task does random memory write) is about
10%~25% faster.  CPU utilization increases a lot though, 2x and even 3x
CPU utilization.  This will depend on disk speed.

While iopoll in swapin isn't intended for all usage cases, it's a win
for latency sensistive workloads with high speed swap disk.  block layer
has knob to control poll in runtime.  If poll isn't enabled in block
layer, there should be no noticeable change in swapin.

I got a chance to run the same test in a NVMe with DRAM as the media.
In simple fio IO test, blkpoll boosts 50% performance in single thread
test and ~20% in 8 threads test.  So this is the base line.  In above
swap test, blkpoll boosts ~27% performance in single thread test.
blkpoll uses 2x CPU time though.

If we enable hybid polling, the performance gain has very slight drop
but CPU time is only 50% worse than that without blkpoll.  Also we can
adjust parameter of hybid poll, with it, the CPU time penality is
reduced further.  In 8 threads test, blkpoll doesn't help though.  The
performance is similar to that without blkpoll, but cpu utilization is
similar too.  There is lock contention in swap path.  The cpu time
spending on blkpoll isn't high.  So overall, blkpoll swapin isn't worse
than that without it.

The swapin readahead might read several pages in in the same time and
form a big IO request.  Since the IO will take longer time, it doesn't
make sense to do poll, so the patch only does iopoll for single page
swapin.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
Signed-off-by: Shaohua Li <shli@fb.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Minchan Kim
0f0746589e mm, THP, swap: move anonymous THP split logic to vmscan
The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache)
so if it fails due to lack of space in case of THP or something(hdd swap
but tries THP swapout) *caller* rather than add_to_swap itself should
split the THP page and retry it with base page which is more natural.

Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Minchan Kim
75f6d6d29a mm, THP, swap: unify swap slot free functions to put_swap_page
Now, get_swap_page takes struct page and allocates swap space according
to page size(ie, normal or THP) so it would be more cleaner to introduce
put_swap_page which is a counter function of get_swap_page.  Then, it
calls right swap slot free function depending on page's size.

[ying.huang@intel.com: minor cleanup and fix]
Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Huang Ying
38d8b4e6bd mm, THP, swap: delay splitting THP during swap out
Patch series "THP swap: Delay splitting THP during swapping out", v11.

This patchset is to optimize the performance of Transparent Huge Page
(THP) swap.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages of the THP swap support include:

 - Batch the swap operations for the THP to reduce lock
   acquiring/releasing, including allocating/freeing the swap space,
   adding/deleting to/from the swap cache, and writing/reading the swap
   space, etc. This will help improve the performance of the THP swap.

 - The THP swap space read/write will be 2M sequential IO. It is
   particularly helpful for the swap read, which are usually 4k random
   IO. This will improve the performance of the THP swap too.

 - It will help the memory fragmentation, especially when the THP is
   heavily used by the applications. The 2M continuous pages will be
   free up after THP swapping out.

 - It will improve the THP utilization on the system with the swap
   turned on. Because the speed for khugepaged to collapse the normal
   pages into the THP is quite slow. After the THP is split during the
   swapping out, it will take quite long time for the normal pages to
   collapse back into the THP after being swapped in. The high THP
   utilization helps the efficiency of the page based memory management
   too.

There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device.  To deal with that, the THP swap in should be turned
on only when necessary.  For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

This patchset is the first step for the THP swap support.  The plan is
to delay splitting THP step by step, finally avoid splitting THP during
the THP swapping out and swap out/in the THP as a whole.

As the first step, in this patchset, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the swap
space for the THP and adding the THP into the swap cache.  This will
reduce lock acquiring/releasing for the locks used for the swap cache
management.

With the patchset, the swap out throughput improves 15.5% (from about
3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
device used is a RAM simulated PMEM (persistent memory) device.  To test
the sequential swapping out, the test case creates 8 processes, which
sequentially allocate and write to the anonymous pages until the RAM and
part of the swap device is used up.

This patch (of 5):

In this patch, splitting huge page is delayed from almost the first step
of swapping out to after allocating the swap space for the THP
(Transparent Huge Page) and adding the THP into the swap cache.  This
will batch the corresponding operation, thus improve THP swap out
throughput.

This is the first step for the THP swap optimization.  The plan is to
delay splitting the THP step by step and avoid splitting the THP
finally.

In this patch, one swap cluster is used to hold the contents of each THP
swapped out.  So, the size of the swap cluster is changed to that of the
THP (Transparent Huge Page) on x86_64 architecture (512).  For other
architectures which want such THP swap optimization,
ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
the architecture.  In effect, this will enlarge swap cluster size by 2
times on x86_64.  Which may make it harder to find a free cluster when
the swap space becomes fragmented.  So that, this may reduce the
continuous swap space allocation and sequential write in theory.  The
performance test in 0day shows no regressions caused by this.

In the future of THP swap optimization, some information of the swapped
out THP (such as compound map count) will be recorded in the
swap_cluster_info data structure.

The mem cgroup swap accounting functions are enhanced to support charge
or uncharge a swap cluster backing a THP as a whole.

The swap cluster allocate/free functions are added to allocate/free a
swap cluster for a THP.  A fair simple algorithm is used for swap
cluster allocation, that is, only the first swap device in priority list
will be tried to allocate the swap cluster.  The function will fail if
the trying is not successful, and the caller will fallback to allocate a
single swap slot instead.  This works good enough for normal cases.  If
the difference of the number of the free swap clusters among multiple
swap devices is significant, it is possible that some THPs are split
earlier than necessary.  For example, this could be caused by big size
difference among multiple swap devices.

The swap cache functions is enhanced to support add/delete THP to/from
the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
enhanced in the future with multi-order radix tree.  But because we will
split the THP soon during swapping out, that optimization doesn't make
much sense for this first step.

The THP splitting functions are enhanced to support to split THP in swap
cache during swapping out.  The page lock will be held during allocating
the swap cluster, adding the THP into the swap cache and splitting the
THP.  So in the code path other than swapping out, if the THP need to be
split, the PageSwapCache(THP) will be always false.

The swap cluster is only available for SSD, so the THP swap optimization
in this patchset has no effect for HDD.

[ying.huang@intel.com: fix two issues in THP optimize patch]
  Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
[hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-06 16:24:31 -07:00
Huang Ying
54f180d3c1 mm, swap: use kvzalloc to allocate some swap data structures
Now vzalloc() is used in swap code to allocate various data structures,
such as swap cache, swap slots cache, cluster info, etc.  Because the
size may be too large on some system, so that normal kzalloc() may fail.
But using kzalloc() has some advantages, for example, less memory
fragmentation, less TLB pressure, etc.  So change the data structure
allocation in swap code to use kvzalloc() which will try kzalloc()
firstly, and fallback to vzalloc() if kzalloc() failed.

In general, although kmalloc() will reduce the number of high-order
pages in short term, vmalloc() will cause more pain for memory
fragmentation in the long term.  And the swap data structure allocation
that is changed in this patch is expected to be long term allocation.

From Dave Hansen:
 "for example, we have a two-page data structure. vmalloc() takes two
  effectively random order-0 pages, probably from two different 2M pages
  and pins them. That "kills" two 2M pages. kmalloc(), allocating two
  *contiguous* pages, will not cross a 2M boundary. That means it will
  only "kill" the possibility of a single 2M page. More 2M pages == less
  fragmentation.

The allocation in this patch occurs during swap on time, which is
usually done during system boot, so usually we have high opportunity to
allocate the contiguous pages successfully.

The allocation for swap_map[] in struct swap_info_struct is not changed,
because that is usually quite large and vmalloc_to_page() is used for
it.  That makes it a little harder to change.

Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
Signed-off-by: Huang Ying <ying.huang@intel.com>
Acked-by: Tim Chen <tim.c.chen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Huang Ying
9c1cc2e4f2 mm, swap: fix comment in __read_swap_cache_async
Commit cbab0e4eec29 ("swap: avoid read_swap_cache_async() race to
deadlock while waiting on discard I/O completion") fixed a deadlock in
read_swap_cache_async().  Because at that time, in swap allocation path,
a swap entry may be set as SWAP_HAS_CACHE, then wait for discarding to
complete before the page for the swap entry is added to the swap cache.

But in commit 815c2c543d3a ("swap: make swap discard async"), the
discarding for swap become asynchronous, waiting for discarding to
complete will be done before the swap entry is set as SWAP_HAS_CACHE.
So the comments in code is incorrect now.  This patch fixes the
comments.

The cond_resched() added in the commit cbab0e4eec29 is not necessary now
too.  But if we added some sleep in swap allocation path in the future,
there may be some hard to debug/reproduce deadlock bug.  So it is kept.

Link: http://lkml.kernel.org/r/20170317064635.12792-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03 15:52:10 -07:00
Huang Ying
ba81f83842 mm/swap: skip readahead only when swap slot cache is enabled
Because during swap off, a swap entry may have swap_map[] ==
SWAP_HAS_CACHE (for example, just allocated).  If we return NULL in
__read_swap_cache_async(), the swap off will abort.  So when swap slot
cache is disabled, (for swap off), we will wait for page to be put into
swap cache in such race condition.  This should not be a problem for swap
slot cache, because swap slot cache should be drained after clearing
swap_slot_cache_enabled.

[ying.huang@intel.com: fix memory leak in __read_swap_cache_async()]
  Link: http://lkml.kernel.org/r/874lzt6znd.fsf@yhuang-dev.intel.com
Link: http://lkml.kernel.org/r/5e2c5f6abe8e6eb0797408897b1bba80938e9b9d.1484082593.git.tim.c.chen@linux.intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
Tim Chen
67afa38e01 mm/swap: add cache for swap slots allocation
We add per cpu caches for swap slots that can be allocated and freed
quickly without the need to touch the swap info lock.

Two separate caches are maintained for swap slots allocated and swap
slots returned.  This is to allow the swap slots to be returned to the
global pool in a batch so they will have a chance to be coaelesced with
other slots in a cluster.  We do not reuse the slots that are returned
right away, as it may increase fragmentation of the slots.

The swap allocation cache is protected by a mutex as we may sleep when
searching for empty slots in cache.  The swap free cache is protected by
a spin lock as we cannot sleep in the free path.

We refill the swap slots cache when we run out of slots, and we disable
the swap slots cache and drain the slots if the global number of slots
fall below a low watermark threshold.  We re-enable the cache agian when
the slots available are above a high watermark.

[ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
[tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
  Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
Tim Chen
e8c26ab605 mm/swap: skip readahead for unreferenced swap slots
We can avoid needlessly allocating page for swap slots that are not used
by anyone.  No pages have to be read in for these slots.

Link: http://lkml.kernel.org/r/0784b3f20b9bd3aa5552219624cb78dc4ae710c9.1484082593.git.tim.c.chen@linux.intel.com
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
Huang, Ying
4b3ef9daa4 mm/swap: split swap cache into 64MB trunks
The patch is to improve the scalability of the swap out/in via using
fine grained locks for the swap cache.  In current kernel, one address
space will be used for each swap device.  And in the common
configuration, the number of the swap device is very small (one is
typical).  This causes the heavy lock contention on the radix tree of
the address space if multiple tasks swap out/in concurrently.

But in fact, there is no dependency between pages in the swap cache.  So
that, we can split the one shared address space for each swap device
into several address spaces to reduce the lock contention.  In the
patch, the shared address space is split into 64MB trunks.  64MB is
chosen to balance the memory space usage and effect of lock contention
reduction.

The size of struct address_space on x86_64 architecture is 408B, so with
the patch, 6528B more memory will be used for every 1GB swap space on
x86_64 architecture.

One address space is still shared for the swap entries in the same 64M
trunks.  To avoid lock contention for the first round of swap space
allocation, the order of the swap clusters in the initial free clusters
list is changed.  The swap space distance between the consecutive swap
clusters in the free cluster list is at least 64M.  After the first
round of allocation, the swap clusters are expected to be freed
randomly, so the lock contention should be reduced effectively.

Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-22 16:41:30 -08:00
Huang Ying
f6ab1f7f6b mm, swap: use offset of swap entry as key of swap cache
This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0.  Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device.  If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture.  For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11.  But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4.  The increased height causes unnecessary radix tree
descending and increased cache footprint.

This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache.  In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.

Use the whole swap entry as key,

  perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

Use the swap offset as key,

  perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Aaron Lu
6fcb52a56f thp: reduce usage of huge zero page's atomic counter
The global zero page is used to satisfy an anonymous read fault.  If
THP(Transparent HugePage) is enabled then the global huge zero page is
used.  The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults.  This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
the global counter in two cases:

 1 The first time it uses the global huge zero page;
 2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away.  With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either.  Since no kthread is using huge zero page today, there
is no difference after applying this patch.  But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

  usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

  CPU cycles from perf report for base commit:
      54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
  CPU cycles from perf report for this commit:
       0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Huang Ying
371a096edf mm: don't use radix tree writeback tags for pages in swap cache
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback.  But for anonymous pages in the swap cache,
there is no inode writeback.  So there is no need to find the pages with
some writeback tags in the radix tree.  It is not necessary to touch
radix tree writeback tags for pages in the swap cache.

Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags.  The flag is set for swap caches.  It may be used for DAX file
systems, etc.

With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
simulated PMEM (persistent memory) device.  The improvement comes from
the reduced contention on the swap cache radix tree lock.  To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.

Details of comparison is as follow,

base             base+patch
---------------- --------------------------
         %stddev     %change         %stddev
             \          |                \
   2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
   1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
     10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
     10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
     10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
     10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page

Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Mel Gorman
11fb998986 mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Gerald Schaefer
770a537022 mm: thp: broken page count after commit aa88b68c3b1d
Christian Borntraeger reported a kernel panic after corrupt page counts,
and it turned out to be a regression introduced with commit aa88b68c3b1d
("thp: keep huge zero page pinned until tlb flush"), at least on s390.

put_huge_zero_page() was moved over from zap_huge_pmd() to
release_pages(), and it was replaced by tlb_remove_page().  However,
release_pages() might not always be triggered by (the arch-specific)
tlb_remove_page().

On s390 we call free_page_and_swap_cache() from tlb_remove_page(), and
not tlb_flush_mmu() -> free_pages_and_swap_cache() like the generic
version, because we don't use the MMU-gather logic.  Although both
functions have very similar names, they are doing very unsimilar things,
in particular free_page_xxx is just doing a put_page(), while
free_pages_xxx calls release_pages().

This of course results in very harmful put_page()s on the huge zero
page, on architectures where tlb_remove_page() is implemented in this
way.  It seems to affect only s390 and sh, but sh doesn't have THP
support, so the problem (currently) probably only exists on s390.

The following quick hack fixed the issue:

Link: http://lkml.kernel.org/r/20160602172141.75c006a9@thinkpad
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org>	[4.6.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Hugh Dickins
fa9949da59 mm: use __SetPageSwapBacked and dont ClearPageSwapBacked
v3.16 commit 07a427884348 ("mm: shmem: avoid atomic operation during
shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
by __SetPageSwapBacked, pointing out that the newly allocated page is
not yet visible to other users (except speculative get_page_unless_zero-
ers, who may not update page flags before their further checks).

That was part of a series in which Mel was focused on tmpfs profiles:
but almost all SetPageSwapBacked uses can be so optimized, with the same
justification.

Remove ClearPageSwapBacked from __read_swap_cache_async() error path:
it's not an error to free a page with PG_swapbacked set.

Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
doing it differently in different places; but that's for tidiness - if
the ordering actually mattered, we should not be using the __variants.

There's probably scope for further __SetPageFlags in other places, but
SwapBacked is the one I'm interested in at the moment.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Vladimir Davydov
37e8435119 mm: memcontrol: charge swap to cgroup2
This patchset introduces swap accounting to cgroup2.

This patch (of 7):

In the legacy hierarchy we charge memsw, which is dubious, because:

 - memsw.limit must be >= memory.limit, so it is impossible to limit
   swap usage less than memory usage. Taking into account the fact that
   the primary limiting mechanism in the unified hierarchy is
   memory.high while memory.limit is either left unset or set to a very
   large value, moving memsw.limit knob to the unified hierarchy would
   effectively make it impossible to limit swap usage according to the
   user preference.

 - memsw.usage != memory.usage + swap.usage, because a page occupying
   both swap entry and a swap cache page is charged only once to memsw
   counter. As a result, it is possible to effectively eat up to
   memory.limit of memory pages *and* memsw.limit of swap entries, which
   looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.

This patch adds mem_cgroup->swap counter, which charges the actual number
of swap entries used by a cgroup.  It is only charged in the unified
hierarchy, while the legacy hierarchy memsw logic is left intact.

The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.

Note, to charge swap resource properly in the unified hierarchy, we have
to make swap_entry_free uncharge swap only when ->usage reaches zero, not
just ->count, i.e.  when all references to a swap entry, including the one
taken by swap cache, are gone.  This is necessary, because otherwise
swap-in could result in uncharging swap even if the page is still in swap
cache and hence still occupies a swap entry.  At the same time, this
shouldn't break memsw counter logic, where a page is never charged twice
for using both memory and swap, because in case of legacy hierarchy we
uncharge swap on commit (see mem_cgroup_commit_charge).

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Minchan Kim
854e9ed09d mm: support madvise(MADV_FREE)
Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation +
zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE.  The ones that immediately
: come to mind are redis, varnish, and MariaDB.  I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX).  The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the
range.  If memory pressure happens, VM checks dirty bit of page table
and if it found still "clean", it means it's a "lazyfree pages" so VM
could discard the page instead of swapping out.  Once there was store
operation for the page before VM peek a page to reclaim, dirty bit is
set so VM can swap out the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty
bit in page table entry to decide whether VM allows to discard the page
or not.  IOW, if page table entry includes marked dirty bit, VM
shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table
entry doesn't have dirty bit so MADV_FREE could discard the page
wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty and
PageSwapCache.  It worked out because swapped-in page lives on swap
cache and since it is evicted from the swap cache, the page has PG_dirty
flag.  So both page flags check effectively prevent wrong discarding by
MADV_FREE.

However, a problem in above logic is that swapped-in page has PG_dirty
still after they are removed from swap cache so VM cannot consider the
page as freeable any more even if madvise_free is called in future.

Look at below example for detail.

    ptr = malloc();
    memset(ptr);
    ..
    ..
    .. heavy memory pressure so all of pages are swapped out
    ..
    ..
    var = *ptr; -> a page swapped-in and could be removed from
                   swapcache. Then, page table doesn't mark
                   dirty bit and page descriptor includes PG_dirty
    ..
    ..
    madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
    ..
    ..
    ..
    .. heavy memory pressure again.
    .. In this time, VM cannot discard the page because the page
    .. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is
owned exclusively by current process when madvise is called because
PG_dirty represents ptes's dirtiness in several processes so we could
clear it only if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

  barrios@blaptop:~/benchmark/ebizzy$ lscpu
  Architecture:          x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
  CPU(s):                12
  On-line CPU(s) list:   0-11
  Thread(s) per core:    1
  Core(s) per socket:    1
  Socket(s):             12
  NUMA node(s):          1
  Vendor ID:             GenuineIntel
  CPU family:            6
  Model:                 2
  Stepping:              3
  CPU MHz:               3200.185
  BogoMIPS:              6400.53
  Virtualization:        VT-x
  Hypervisor vendor:     KVM
  Virtualization type:   full
  L1d cache:             32K
  L1i cache:             32K
  L2 cache:              4096K
  NUMA node0 CPU(s):     0-11
  ebizzy benchmark(./ebizzy -S 10 -n 512)

  Higher avg is better.

   vanilla-jemalloc             MADV_free-jemalloc

  1 thread
  records: 10                   records: 10
  avg:   2961.90                avg:  12069.70
  std:     71.96(2.43%)         std:    186.68(1.55%)
  max:   3070.00                max:  12385.00
  min:   2796.00                min:  11746.00

  2 thread
  records: 10                   records: 10
  avg:   5020.00                avg:  17827.00
  std:    264.87(5.28%)         std:    358.52(2.01%)
  max:   5244.00                max:  18760.00
  min:   4251.00                min:  17382.00

  4 thread
  records: 10                   records: 10
  avg:   8988.80                avg:  27930.80
  std:   1175.33(13.08%)        std:   3317.33(11.88%)
  max:   9508.00                max:  30879.00
  min:   5477.00                min:  21024.00

  8 thread
  records: 10                   records: 10
  avg:  13036.50                avg:  33739.40
  std:    170.67(1.31%)         std:   5146.22(15.25%)
  max:  13371.00                max:  40572.00
  min:  12785.00                min:  24088.00

  16 thread
  records: 10                   records: 10
  avg:  11092.40                avg:  31424.20
  std:    710.60(6.41%)         std:   3763.89(11.98%)
  max:  12446.00                max:  36635.00
  min:   9949.00                min:  25669.00

  32 thread
  records: 10                   records: 10
  avg:  11067.00                avg:  34495.80
  std:    971.06(8.77%)         std:   2721.36(7.89%)
  max:  12010.00                max:  38598.00
  min:   9002.00                min:  30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

This patch (of 12):

Add core MADV_FREE implementation.

[akpm@linux-foundation.org: small cleanups]
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Mika Penttil <mika.penttila@nextfour.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jason Evans <je@fb.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Shaohua Li <shli@kernel.org>
Cc: <yalin.wang2010@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Shaohua Li" <shli@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Roland Dreier <roland@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Shaohua Li <shli@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
48c935ad88 page-flags: define PG_locked behavior on compound pages
lock_page() must operate on the whole compound page.  It doesn't make
much sense to lock part of compound page.  Change code to use head
page's PG_locked, if tail page is passed.

This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked().  They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
helper would trigger VM_BUG_ON().

SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
appear there.  VM_BUG_ON() is added to make sure that this assumption is
correct.

[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Dmitry Safonov
5b999aadba mm: swap: zswap: maybe_preload & refactoring
zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
same code with only significant difference in return value and usage of
swap_readpage.

I a helper __read_swap_cache_async() with the common code.  Behavior
change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
instead radix_tree_preload.  Looks like, this wasn't changed only by the
reason of code duplication.

Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Herrmann <dh.herrmann@gmail.com>
Cc: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Jason Low
4db0c3c298 mm: remove rest of ACCESS_ONCE() usages
We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.

This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses.  This makes things cleaner, instead
of using separate/multiple sets of APIs.

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:18 -07:00
Christoph Hellwig
b83ae6d421 fs: remove mapping->backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:05 -07:00
Christoph Hellwig
97b713ba3e fs: kill BDI_CAP_SWAP_BACKED
This bdi flag isn't too useful - we can determine that a vma is backed by
either swap or shmem trivially in the caller.

This also allows removing the backing_dev_info instaces for swap and shmem
in favor of noop_backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:02:56 -07:00
Johannes Weiner
5d1ea48bdd mm: page_cgroup: rename file to mm/swap_cgroup.c
Now that the external page_cgroup data structure and its lookup is gone,
the only code remaining in there is swap slot accounting.

Rename it and move the conditional compilation into mm/Makefile.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Michal Hocko
aabfb57296 mm: memcontrol: do not kill uncharge batching in free_pages_and_swap_cache
free_pages_and_swap_cache limits release_pages to PAGEVEC_SIZE chunks.
This is not a big deal for the normal release path but it completely kills
memcg uncharge batching which reduces res_counter spin_lock contention.
Dave has noticed this with his page fault scalability test case on a large
machine when the lock was basically dominating on all CPUs:

    80.18%    80.18%  [kernel]               [k] _raw_spin_lock
                  |
                  --- _raw_spin_lock
                     |
                     |--66.59%-- res_counter_uncharge_until
                     |          res_counter_uncharge
                     |          uncharge_batch
                     |          uncharge_list
                     |          mem_cgroup_uncharge_list
                     |          release_pages
                     |          free_pages_and_swap_cache
                     |          tlb_flush_mmu_free
                     |          |
                     |          |--90.12%-- unmap_single_vma
                     |          |          unmap_vmas
                     |          |          unmap_region
                     |          |          do_munmap
                     |          |          vm_munmap
                     |          |          sys_munmap
                     |          |          system_call_fastpath
                     |          |          __GI___munmap
                     |          |
                     |           --9.88%-- tlb_flush_mmu
                     |                     tlb_finish_mmu
                     |                     unmap_region
                     |                     do_munmap
                     |                     vm_munmap
                     |                     sys_munmap
                     |                     system_call_fastpath
                     |                     __GI___munmap

In his case the load was running in the root memcg and that part has been
handled by reverting 05b843012335 ("mm: memcontrol: use root_mem_cgroup
res_counter") because this is a clear regression, but the problem remains
inside dedicated memcgs.

There is no reason to limit release_pages to PAGEVEC_SIZE batches other
than lru_lock held times.  This logic, however, can be moved inside the
function.  mem_cgroup_uncharge_list and free_hot_cold_page_list do not
hold any lock for the whole pages_to_free list so it is safe to call them
in a single run.

The release_pages() code was previously breaking the lru_lock each
PAGEVEC_SIZE pages (ie, 14 pages).  However this code has no usage of
pagevecs so switch to breaking the lock at least every SWAP_CLUSTER_MAX
(32) pages.  This means that the lock acquisition frequency is
approximately halved and the max hold times are approximately doubled.

The now unneeded batching is removed from free_pages_and_swap_cache().

Also update the grossly out-of-date release_pages documentation.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Dave Hansen <dave@sr71.net>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:59 -04:00
Andrew Morton
1c93923cc2 include/linux/migrate.h: remove migrate_page #define
This is designed to avoid a few ifdefs in .c files but it's obnoxious
because it can cause unsuspecting "migrate_page" symbols to get turned into
"NULL".

Just nuke it and use the ifdefs.

Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:56 -04:00
David Herrmann
4bb5f5d939 mm: allow drivers to prevent new writable mappings
This patch (of 6):

The i_mmap_writable field counts existing writable mappings of an
address_space.  To allow drivers to prevent new writable mappings, make
this counter signed and prevent new writable mappings if it is negative.
This is modelled after i_writecount and DENYWRITE.

This will be required by the shmem-sealing infrastructure to prevent any
new writable mappings after the WRITE seal has been set.  In case there
exists a writable mapping, this operation will fail with EBUSY.

Note that we rely on the fact that iff you already own a writable mapping,
you can increase the counter without using the helpers.  This is the same
that we do for i_writecount.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
Johannes Weiner
0a31bc97c8 mm: memcontrol: rewrite uncharge API
The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages.  However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped.  Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge.  The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache().  However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration.  Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration.  Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:17 -07:00
Mel Gorman
b745bc85f2 mm: page_alloc: convert hot/cold parameter and immediate callers to bool
cold is a bool, make it one.  Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04 16:54:09 -07:00
Shaohua Li
579f82901f swap: add a simple detector for inappropriate swapin readahead
This is a patch to improve swap readahead algorithm.  It's from Hugh and
I slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin is
sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they can
be reclaimed easily - though, what if their allocation forced reclaim of
useful pages? But on SSD devices large reads are more expensive than
small ones: if the readahead pages are unneeded, reading them in caused
significant overhead.

This patch adds very simplistic random read detection.  Stealing the
PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
simply looks at readahead's current success rate, and narrows or widens
its readahead window accordingly.  There is little science to its
heuristic: it's about as stupid as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running a
single repetitive swapping load across a 1000MB mapping in 900MB ram
with 1GB swap (the harddisk tests had taken painfully too long when I
used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
which Shaohua showed to be defective; HughNew this Nov 14 patch, with
page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
(1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
sequential access to the mapping, cycling five times around; Rand for
the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did not
optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case: under
heavier mixed loads I've not yet observed any consistent improvement or
degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's
patch.  I observed with Hugh's patch sometimes the readahead size is
shrinked too fast (from 8 to 1 immediately) in sequential workload if
there is no hit.  And in such case, continuing doing readahead is good
actually.

I don't prepare a sophisticated algorithm for the sequential workload
because so far we can't guarantee sequential accessed pages are swap out
sequentially.  So I slightly change Hugh's heuristic - don't shrink
readahead size too fast.

Here is my test result (unit second, 3 runs average):
	Vanilla		Hugh		New
Seq	356		370		360
Random	4525		2447		2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat
2'.  The first part is running a random workload (till around 1200 of
the x-axis) and the second part is running a sequential workload.
swapin and swapout throughput are almost identical in steady state in
both workloads.  These are expected behavior.  while in Vanilla, swapin
is much bigger than swapout especially in random workload (because wrong
readahead).

Original patches by: Shaohua Li and Konstantin Khlebnikov.

[fengguang.wu@intel.com: swapin_nr_pages() can be static]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06 13:48:51 -08:00