106 Commits

Author SHA1 Message Date
John Galt
6cc0485823
mm/util: bump overcommit_ratio
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-07-31 09:24:50 +00:00
Eric Biggers
d73f58abbf Merge 4.14.285 into android-4.14-stable
Changes in 4.14.285
	9p: missing chunk of "fs/9p: Don't update file type when updating file attributes"
	crypto: chacha20 - Fix keystream alignment for chacha20_block()
	random: always fill buffer in get_random_bytes_wait
	random: optimize add_interrupt_randomness
	drivers/char/random.c: remove unused dont_count_entropy
	random: Fix whitespace pre random-bytes work
	random: Return nbytes filled from hw RNG
	random: add a config option to trust the CPU's hwrng
	random: remove preempt disabled region
	random: Make crng state queryable
	random: make CPU trust a boot parameter
	drivers/char/random.c: constify poolinfo_table
	drivers/char/random.c: remove unused stuct poolinfo::poolbits
	drivers/char/random.c: make primary_crng static
	random: only read from /dev/random after its pool has received 128 bits
	random: move rand_initialize() earlier
	random: document get_random_int() family
	latent_entropy: avoid build error when plugin cflags are not set
	random: fix soft lockup when trying to read from an uninitialized blocking pool
	random: Support freezable kthreads in add_hwgenerator_randomness()
	fdt: add support for rng-seed
	random: Use wait_event_freezable() in add_hwgenerator_randomness()
	char/random: Add a newline at the end of the file
	Revert "hwrng: core - Freeze khwrng thread during suspend"
	crypto: Deduplicate le32_to_cpu_array() and cpu_to_le32_array()
	crypto: blake2s - generic C library implementation and selftest
	lib/crypto: blake2s: move hmac construction into wireguard
	lib/crypto: sha1: re-roll loops to reduce code size
	random: Don't wake crng_init_wait when crng_init == 1
	random: Add a urandom_read_nowait() for random APIs that don't warn
	random: add GRND_INSECURE to return best-effort non-cryptographic bytes
	random: ignore GRND_RANDOM in getentropy(2)
	random: make /dev/random be almost like /dev/urandom
	char/random: silence a lockdep splat with printk()
	random: fix crash on multiple early calls to add_bootloader_randomness()
	random: remove the blocking pool
	random: delete code to pull data into pools
	random: remove kernel.random.read_wakeup_threshold
	random: remove unnecessary unlikely()
	random: convert to ENTROPY_BITS for better code readability
	random: Add and use pr_fmt()
	random: fix typo in add_timer_randomness()
	random: remove some dead code of poolinfo
	random: split primary/secondary crng init paths
	random: avoid warnings for !CONFIG_NUMA builds
	x86: Remove arch_has_random, arch_has_random_seed
	powerpc: Remove arch_has_random, arch_has_random_seed
	s390: Remove arch_has_random, arch_has_random_seed
	linux/random.h: Remove arch_has_random, arch_has_random_seed
	linux/random.h: Use false with bool
	linux/random.h: Mark CONFIG_ARCH_RANDOM functions __must_check
	powerpc: Use bool in archrandom.h
	random: add arch_get_random_*long_early()
	random: avoid arch_get_random_seed_long() when collecting IRQ randomness
	random: remove dead code left over from blocking pool
	MAINTAINERS: co-maintain random.c
	crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h>
	crypto: blake2s - adjust include guard naming
	random: document add_hwgenerator_randomness() with other input functions
	random: remove unused irq_flags argument from add_interrupt_randomness()
	random: use BLAKE2s instead of SHA1 in extraction
	random: do not sign extend bytes for rotation when mixing
	random: do not re-init if crng_reseed completes before primary init
	random: mix bootloader randomness into pool
	random: harmonize "crng init done" messages
	random: use IS_ENABLED(CONFIG_NUMA) instead of ifdefs
	random: initialize ChaCha20 constants with correct endianness
	random: early initialization of ChaCha constants
	random: avoid superfluous call to RDRAND in CRNG extraction
	random: don't reset crng_init_cnt on urandom_read()
	random: fix typo in comments
	random: cleanup poolinfo abstraction
	crypto: chacha20 - Fix chacha20_block() keystream alignment (again)
	random: cleanup integer types
	random: remove incomplete last_data logic
	random: remove unused extract_entropy() reserved argument
	random: try to actively add entropy rather than passively wait for it
	random: rather than entropy_store abstraction, use global
	random: remove unused OUTPUT_POOL constants
	random: de-duplicate INPUT_POOL constants
	random: prepend remaining pool constants with POOL_
	random: cleanup fractional entropy shift constants
	random: access input_pool_data directly rather than through pointer
	random: simplify arithmetic function flow in account()
	random: continually use hwgenerator randomness
	random: access primary_pool directly rather than through pointer
	random: only call crng_finalize_init() for primary_crng
	random: use computational hash for entropy extraction
	random: simplify entropy debiting
	random: use linear min-entropy accumulation crediting
	random: always wake up entropy writers after extraction
	random: make credit_entropy_bits() always safe
	random: remove use_input_pool parameter from crng_reseed()
	random: remove batched entropy locking
	random: fix locking in crng_fast_load()
	random: use RDSEED instead of RDRAND in entropy extraction
	random: inline leaves of rand_initialize()
	random: ensure early RDSEED goes through mixer on init
	random: do not xor RDRAND when writing into /dev/random
	random: absorb fast pool into input pool after fast load
	random: use hash function for crng_slow_load()
	random: remove outdated INT_MAX >> 6 check in urandom_read()
	random: zero buffer after reading entropy from userspace
	random: tie batched entropy generation to base_crng generation
	random: remove ifdef'd out interrupt bench
	random: remove unused tracepoints
	random: add proper SPDX header
	random: deobfuscate irq u32/u64 contributions
	random: introduce drain_entropy() helper to declutter crng_reseed()
	random: remove useless header comment
	random: remove whitespace and reorder includes
	random: group initialization wait functions
	random: group entropy extraction functions
	random: group entropy collection functions
	random: group userspace read/write functions
	random: group sysctl functions
	random: rewrite header introductory comment
	random: defer fast pool mixing to worker
	random: do not take pool spinlock at boot
	random: unify early init crng load accounting
	random: check for crng_init == 0 in add_device_randomness()
	random: pull add_hwgenerator_randomness() declaration into random.h
	random: clear fast pool, crng, and batches in cpuhp bring up
	random: round-robin registers as ulong, not u32
	random: only wake up writers after zap if threshold was passed
	random: cleanup UUID handling
	random: unify cycles_t and jiffies usage and types
	random: do crng pre-init loading in worker rather than irq
	random: give sysctl_random_min_urandom_seed a more sensible value
	random: don't let 644 read-only sysctls be written to
	random: replace custom notifier chain with standard one
	random: use SipHash as interrupt entropy accumulator
	random: make consistent usage of crng_ready()
	random: reseed more often immediately after booting
	random: check for signal and try earlier when generating entropy
	random: skip fast_init if hwrng provides large chunk of entropy
	random: treat bootloader trust toggle the same way as cpu trust toggle
	random: re-add removed comment about get_random_{u32,u64} reseeding
	random: mix build-time latent entropy into pool at init
	random: do not split fast init input in add_hwgenerator_randomness()
	random: do not allow user to keep crng key around on stack
	random: check for signal_pending() outside of need_resched() check
	random: check for signals every PAGE_SIZE chunk of /dev/[u]random
	random: make random_get_entropy() return an unsigned long
	random: document crng_fast_key_erasure() destination possibility
	random: fix sysctl documentation nits
	init: call time_init() before rand_initialize()
	ia64: define get_cycles macro for arch-override
	s390: define get_cycles macro for arch-override
	parisc: define get_cycles macro for arch-override
	alpha: define get_cycles macro for arch-override
	powerpc: define get_cycles macro for arch-override
	timekeeping: Add raw clock fallback for random_get_entropy()
	m68k: use fallback for random_get_entropy() instead of zero
	mips: use fallback for random_get_entropy() instead of just c0 random
	arm: use fallback for random_get_entropy() instead of zero
	nios2: use fallback for random_get_entropy() instead of zero
	x86/tsc: Use fallback for random_get_entropy() instead of zero
	um: use fallback for random_get_entropy() instead of zero
	sparc: use fallback for random_get_entropy() instead of zero
	xtensa: use fallback for random_get_entropy() instead of zero
	random: insist on random_get_entropy() existing in order to simplify
	random: do not use batches when !crng_ready()
	random: do not pretend to handle premature next security model
	random: order timer entropy functions below interrupt functions
	random: do not use input pool from hard IRQs
	random: help compiler out with fast_mix() by using simpler arguments
	siphash: use one source of truth for siphash permutations
	random: use symbolic constants for crng_init states
	random: avoid initializing twice in credit race
	random: remove ratelimiting for in-kernel unseeded randomness
	random: use proper jiffies comparison macro
	random: handle latent entropy and command line from random_init()
	random: credit architectural init the exact amount
	random: use static branch for crng_ready()
	random: remove extern from functions in header
	random: use proper return types on get_random_{int,long}_wait()
	random: move initialization functions out of hot pages
	random: move randomize_page() into mm where it belongs
	random: convert to using fops->write_iter()
	random: wire up fops->splice_{read,write}_iter()
	random: check for signals after page of pool writes
	Revert "random: use static branch for crng_ready()"
	crypto: drbg - add FIPS 140-2 CTRNG for noise source
	crypto: drbg - always seeded with SP800-90B compliant noise source
	crypto: drbg - prepare for more fine-grained tracking of seeding state
	crypto: drbg - track whether DRBG was seeded with !rng_is_initialized()
	crypto: drbg - move dynamic ->reseed_threshold adjustments to __drbg_seed()
	crypto: drbg - always try to free Jitter RNG instance
	crypto: drbg - make reseeding from get_random_bytes() synchronous
	random: avoid checking crng_ready() twice in random_init()
	random: mark bootloader randomness code as __init
	random: account for arch randomness in bits
	ASoC: cs42l52: Fix TLV scales for mixer controls
	ASoC: cs53l30: Correct number of volume levels on SX controls
	ASoC: cs42l52: Correct TLV for Bypass Volume
	ASoC: cs42l56: Correct typo in minimum level for SX volume controls
	ata: libata-core: fix NULL pointer deref in ata_host_alloc_pinfo()
	ASoC: wm8962: Fix suspend while playing music
	scsi: vmw_pvscsi: Expand vcpuHint to 16 bits
	scsi: lpfc: Fix port stuck in bypassed state after LIP in PT2PT topology
	scsi: ipr: Fix missing/incorrect resource cleanup in error case
	scsi: pmcraid: Fix missing resource cleanup in error case
	virtio-mmio: fix missing put_device() when vm_cmdline_parent registration failed
	nfc: nfcmrvl: Fix memory leak in nfcmrvl_play_deferred
	ipv6: Fix signed integer overflow in l2tp_ip6_sendmsg
	net: ethernet: mtk_eth_soc: fix misuse of mem alloc interface netdev[napi]_alloc_frag
	random: credit cpu and bootloader seeds by default
	pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE
	i40e: Fix call trace in setup_tx_descriptors
	tty: goldfish: Fix free_irq() on remove
	misc: atmel-ssc: Fix IRQ check in ssc_probe
	net: bgmac: Fix an erroneous kfree() in bgmac_remove()
	arm64: ftrace: fix branch range checks
	certs/blacklist_hashes.c: fix const confusion in certs blacklist
	irqchip/gic/realview: Fix refcount leak in realview_gic_of_init
	comedi: vmk80xx: fix expression for tx buffer size
	USB: serial: option: add support for Cinterion MV31 with new baseline
	USB: serial: io_ti: add Agilent E5805A support
	usb: dwc2: Fix memory leak in dwc2_hcd_init
	usb: gadget: lpc32xx_udc: Fix refcount leak in lpc32xx_udc_probe
	serial: 8250: Store to lsr_save_flags after lsr read
	ext4: fix bug_on ext4_mb_use_inode_pa
	ext4: make variable "count" signed
	ext4: add reserved GDT blocks check
	virtio-pci: Remove wrong address verification in vp_del_vqs()
	l2tp: don't use inet_shutdown on ppp session destroy
	l2tp: fix race in pppol2tp_release with session object destroy
	s390/mm: use non-quiescing sske for KVM switch to keyed guest
	usb: gadget: u_ether: fix regression in setting fixed MAC address
	xprtrdma: fix incorrect header size calculations
	tcp: add some entropy in __inet_hash_connect()
	tcp: use different parts of the port_offset for index and offset
	tcp: add small random increments to the source port
	tcp: dynamically allocate the perturb table used by source ports
	tcp: increase source port perturb table to 2^16
	tcp: drop the hash_32() part from the index calculation
	Linux 4.14.285

Conflicts:
	crypto/chacha20_generic.c
	drivers/char/random.c
	drivers/of/fdt.c
	include/crypto/chacha20.h
	lib/chacha20.c

Merge resolution notes:
  - Added CHACHA20_KEY_SIZE and CHACHA20_BLOCK_SIZE constants to
    chacha.h, to minimize changes from the 4.14.285 version of random.c

  - Updated lib/vsprintf.c for
    "random: replace custom notifier chain with standard one".

Change-Id: I6a4ca9b12ed23f76bac6c4c9e6306e2b354e2752
Signed-off-by: Eric Biggers <ebiggers@google.com>
2022-06-28 18:00:02 +00:00
Jason A. Donenfeld
225c0df138 random: move randomize_page() into mm where it belongs
commit 5ad7dd882e45d7fe432c32e896e2aaa0b21746ea upstream.

randomize_page is an mm function. It is documented like one. It contains
the history of one. It has the naming convention of one. It looks
just like another very similar function in mm, randomize_stack_top().
And it has always been maintained and updated by mm people. There is no
need for it to be in random.c. In the "which shape does not look like
the other ones" test, pointing to randomize_page() is correct.

So move randomize_page() into mm/util.c, right next to the similar
randomize_stack_top() function.

This commit contains no actual code changes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-25 11:46:40 +02:00
Greg Kroah-Hartman
d6d7b18f40 This is the 4.14.185 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl7tx/EACgkQONu9yGCS
 aT4/ABAApnoLXN+Drzomi5DYYQoVyf+4sTgIHSks4Q7S0TggPYLH/UcYAkqxgmjd
 Lj3VKcCSZqoR6gCeTLK34DSxlKezvxKI/u5LVjdRVdyWc4W2y3InqYBGikPeuGfW
 Ud4E2pGe/NgoZ0jf6dxkIxQx3DqtrsY0742MTCG3BNEYo4B4HMcN0LEUEFtwSjqj
 e1LpE8sCX2MhfvAzCvajBhNlIv4Sdgr47+52yhxjS8h04D6rq92jGsdjUjw5Oqtt
 wPIf2v1zxFq3QkPf3pYW5e/FsfzzDk3pIs1bQ4TnoJkAGfI52eBTjekdRSU0TOiG
 U7RBXsVFCEcB/ec5WkiNzUhK4+SpEuO3SDa6u7OVlQ2wGTT/c1k1prJL0AdXwUJn
 NYqOe1WZpXORajLYbVNAQwvbTN3Chho5FI9w+WoU/WvmzxelqXAUCSTsqEhNV8oS
 ZBWu6anawL8C9d0aYsq1DeV2CMEPmQPWp9HeiNtZgl0ZKkZlqBcxW2dccq9h8PH4
 xqRImk+owbKsT0NV3dV150Q41nc2rTQd2jyGeoA+iek17/XDxWT9ICArA+YfBhx1
 uf+dNPyl8g8ATvTZaS7su/Q7T2rsgBc64DO5R+jtTp4QRgl/N2b4QyewfjusrRXA
 x3Ckq0oA9ZOcfLs9IAoTD5L1mmwSaYUa6fhiZHhop2daUPUbAgE=
 =BJUs
 -----END PGP SIGNATURE-----

Merge 4.14.185 into android-4.14-stable

Changes in 4.14.185
	ipv6: fix IPV6_ADDRFORM operation logic
	vxlan: Avoid infinite loop when suppressing NS messages with invalid options
	make 'user_access_begin()' do 'access_ok()'
	Fix 'acccess_ok()' on alpha and SH
	arch/openrisc: Fix issues with access_ok()
	x86: uaccess: Inhibit speculation past access_ok() in user_access_begin()
	lib: Reduce user_access_begin() boundaries in strncpy_from_user() and strnlen_user()
	serial: imx: Fix handling of TC irq in combination with DMA
	crypto: talitos - fix ECB and CBC algs ivsize
	ARM: 8977/1: ptrace: Fix mask for thumb breakpoint hook
	sched/fair: Don't NUMA balance for kthreads
	Input: synaptics - add a second working PNP_ID for Lenovo T470s
	drivers/net/ibmvnic: Update VNIC protocol version reporting
	powerpc/xive: Clear the page tables for the ESB IO mapping
	ath9k_htc: Silence undersized packet warnings
	perf probe: Accept the instance number of kretprobe event
	mm: add kvfree_sensitive() for freeing sensitive data objects
	x86_64: Fix jiffies ODR violation
	x86/PCI: Mark Intel C620 MROMs as having non-compliant BARs
	x86/speculation: Prevent rogue cross-process SSBD shutdown
	x86/reboot/quirks: Add MacBook6,1 reboot quirk
	efi/efivars: Add missing kobject_put() in sysfs entry creation error path
	ALSA: es1688: Add the missed snd_card_free()
	ALSA: hda/realtek - add a pintbl quirk for several Lenovo machines
	ALSA: usb-audio: Fix inconsistent card PM state after resume
	ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile()
	ACPI: CPPC: Fix reference count leak in acpi_cppc_processor_probe()
	ACPI: GED: add support for _Exx / _Lxx handler methods
	ACPI: PM: Avoid using power resources if there are none for D0
	cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages
	nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
	spi: bcm2835aux: Fix controller unregister order
	spi: bcm-qspi: when tx/rx buffer is NULL set to 0
	crypto: cavium/nitrox - Fix 'nitrox_get_first_device()' when ndevlist is fully iterated
	ALSA: pcm: disallow linking stream to itself
	kvm: x86: Fix L1TF mitigation for shadow MMU
	KVM: x86/mmu: Consolidate "is MMIO SPTE" code
	KVM: x86: only do L1TF workaround on affected processors
	x86/speculation: Change misspelled STIPB to STIBP
	x86/speculation: Add support for STIBP always-on preferred mode
	x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced IBRS.
	x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.
	spi: dw: fix possible race condition
	spi: dw: Fix controller unregister order
	spi: No need to assign dummy value in spi_unregister_controller()
	spi: Fix controller unregister order
	spi: pxa2xx: Fix controller unregister order
	spi: bcm2835: Fix controller unregister order
	crypto: virtio: Fix use-after-free in virtio_crypto_skcipher_finalize_req()
	crypto: virtio: Fix src/dst scatterlist calculation in __virtio_crypto_skcipher_do_req()
	crypto: virtio: Fix dest length calculation in __virtio_crypto_skcipher_do_req()
	selftests/net: in rxtimestamp getopt_long needs terminating null entry
	ovl: initialize error in ovl_copy_xattr
	proc: Use new_inode not new_inode_pseudo
	video: fbdev: w100fb: Fix a potential double free.
	KVM: nSVM: fix condition for filtering async PF
	KVM: nSVM: leave ASID aside in copy_vmcb_control_area
	KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
	KVM: MIPS: Define KVM_ENTRYHI_ASID to cpu_asid_mask(&boot_cpu_data)
	KVM: MIPS: Fix VPN2_MASK definition for variable cpu_vmbits
	KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
	ath9k: Fix use-after-free Read in ath9k_wmi_ctrl_rx
	ath9k: Fix use-after-free Write in ath9k_htc_rx_msg
	ath9x: Fix stack-out-of-bounds Write in ath9k_hif_usb_rx_cb
	ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb
	Smack: slab-out-of-bounds in vsscanf
	mm/slub: fix a memory leak in sysfs_slab_add()
	fat: don't allow to mount if the FAT length == 0
	perf: Add cond_resched() to task_function_call()
	agp/intel: Reinforce the barrier after GTT updates
	mmc: sdhci-msm: Clear tuning done flag while hs400 tuning
	mmc: sdio: Fix potential NULL pointer error in mmc_sdio_init_card()
	can: kvaser_usb: kvaser_usb_leaf: Fix some info-leaks to USB devices
	xen/pvcalls-back: test for errors when calling backend_connect()
	ACPI: GED: use correct trigger type field in _Exx / _Lxx handling
	drm: bridge: adv7511: Extend list of audio sample rates
	crypto: ccp -- don't "select" CONFIG_DMADEVICES
	media: si2157: Better check for running tuner in init
	objtool: Ignore empty alternatives
	spi: pxa2xx: Apply CS clk quirk to BXT
	net: ena: fix error returning in ena_com_get_hash_function()
	spi: dw: Zero DMA Tx and Rx configurations on stack
	ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K
	MIPS: Loongson: Build ATI Radeon GPU driver as module
	Bluetooth: Add SCO fallback for invalid LMP parameters error
	kgdb: Prevent infinite recursive entries to the debugger
	spi: dw: Enable interrupts in accordance with DMA xfer mode
	clocksource: dw_apb_timer: Make CPU-affiliation being optional
	clocksource: dw_apb_timer_of: Fix missing clockevent timers
	btrfs: do not ignore error from btrfs_next_leaf() when inserting checksums
	ARM: 8978/1: mm: make act_mm() respect THREAD_SIZE
	spi: dw: Fix Rx-only DMA transfers
	x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
	net: vmxnet3: fix possible buffer overflow caused by bad DMA value in vmxnet3_get_rss()
	staging: android: ion: use vmap instead of vm_map_ram
	brcmfmac: fix wrong location to get firmware feature
	tools api fs: Make xxx__mountpoint() more scalable
	e1000: Distribute switch variables for initialization
	dt-bindings: display: mediatek: control dpi pins mode to avoid leakage
	audit: fix a net reference leak in audit_send_reply()
	media: dvb: return -EREMOTEIO on i2c transfer failure.
	media: platform: fcp: Set appropriate DMA parameters
	MIPS: Make sparse_init() using top-down allocation
	audit: fix a net reference leak in audit_list_rules_send()
	netfilter: nft_nat: return EOPNOTSUPP if type or flags are not supported
	net: bcmgenet: set Rx mode before starting netif
	lib/mpi: Fix 64-bit MIPS build with Clang
	exit: Move preemption fixup up, move blocking operations down
	net: lpc-enet: fix error return code in lpc_mii_init()
	media: cec: silence shift wrapping warning in __cec_s_log_addrs()
	net: allwinner: Fix use correct return type for ndo_start_xmit()
	powerpc/spufs: fix copy_to_user while atomic
	Crypto/chcr: fix for ccm(aes) failed test
	MIPS: Truncate link address into 32bit for 32bit kernel
	mips: cm: Fix an invalid error code of INTVN_*_ERR
	kgdb: Fix spurious true from in_dbg_master()
	nvme: refine the Qemu Identify CNS quirk
	wcn36xx: Fix error handling path in 'wcn36xx_probe()'
	net: qed*: Reduce RX and TX default ring count when running inside kdump kernel
	md: don't flush workqueue unconditionally in md_open
	rtlwifi: Fix a double free in _rtl_usb_tx_urb_setup()
	mwifiex: Fix memory corruption in dump_station
	x86/boot: Correct relocation destination on old linkers
	mips: MAAR: Use more precise address mask
	mips: Add udelay lpj numbers adjustment
	x86/mm: Stop printing BRK addresses
	m68k: mac: Don't call via_flush_cache() on Mac IIfx
	macvlan: Skip loopback packets in RX handler
	PCI: Don't disable decoding when mmio_always_on is set
	MIPS: Fix IRQ tracing when call handle_fpe() and handle_msa_fpe()
	mmc: sdhci-msm: Set SDHCI_QUIRK_MULTIBLOCK_READ_ACMD12 quirk
	staging: greybus: sdio: Respect the cmd->busy_timeout from the mmc core
	mmc: via-sdmmc: Respect the cmd->busy_timeout from the mmc core
	ixgbe: fix signed-integer-overflow warning
	mmc: sdhci-esdhc-imx: fix the mask for tuning start point
	spi: dw: Return any value retrieved from the dma_transfer callback
	cpuidle: Fix three reference count leaks
	platform/x86: hp-wmi: Convert simple_strtoul() to kstrtou32()
	string.h: fix incompatibility between FORTIFY_SOURCE and KASAN
	btrfs: send: emit file capabilities after chown
	mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()
	ima: Fix ima digest hash table key calculation
	ima: Directly assign the ima_default_policy pointer to ima_rules
	evm: Fix possible memory leak in evm_calc_hmac_or_hash()
	ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max
	ext4: fix error pointer dereference
	ext4: fix race between ext4_sync_parent() and rename()
	PCI: Disable MSI for Freescale Layerscape PCIe RC mode
	PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0
	PCI: Avoid FLR for AMD Starship USB 3.0
	PCI: Add ACS quirk for iProc PAXB
	PCI: Add ACS quirk for Ampere root ports
	PCI: Make ACS quirk implementations more uniform
	vga_switcheroo: Deduplicate power state tracking
	vga_switcheroo: Use device link for HDA controller
	PCI: Generalize multi-function power dependency device links
	PCI: Add ACS quirk for Intel Root Complex Integrated Endpoints
	PCI: Unify ACS quirk desired vs provided checking
	btrfs: fix error handling when submitting direct I/O bio
	btrfs: fix wrong file range cleanup after an error filling dealloc range
	blk-mq: move _blk_mq_update_nr_hw_queues synchronize_rcu call
	PCI: Program MPS for RCiEP devices
	e1000e: Disable TSO for buffer overrun workaround
	e1000e: Relax condition to trigger reset for ME workaround
	carl9170: remove P2P_GO support
	media: go7007: fix a miss of snd_card_free
	b43legacy: Fix case where channel status is corrupted
	b43: Fix connection problem with WPA3
	b43_legacy: Fix connection problem with WPA3
	media: ov5640: fix use of destroyed mutex
	igb: Report speed and duplex as unknown when device is runtime suspended
	power: vexpress: add suppress_bind_attrs to true
	pinctrl: samsung: Save/restore eint_mask over suspend for EINT_TYPE GPIOs
	sparc32: fix register window handling in genregs32_[gs]et()
	sparc64: fix misuses of access_process_vm() in genregs32_[sg]et()
	dm crypt: avoid truncating the logical block size
	kernel/cpu_pm: Fix uninitted local in cpu_pm
	ARM: tegra: Correct PL310 Auxiliary Control Register initialization
	drivers/macintosh: Fix memleak in windfarm_pm112 driver
	powerpc/64s: Don't let DT CPU features set FSCR_DSCR
	powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
	kbuild: force to build vmlinux if CONFIG_MODVERSION=y
	sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
	sunrpc: clean up properly in gss_mech_unregister()
	mtd: rawnand: brcmnand: fix hamming oob layout
	mtd: rawnand: pasemi: Fix the probe error path
	w1: omap-hdq: cleanup to add missing newline for some dev_dbg
	perf probe: Do not show the skipped events
	perf probe: Fix to check blacklist address correctly
	perf symbols: Fix debuginfo search for Ubuntu
	Linux 4.14.185

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ifd3a6f3d9643a42802ed8f061a548a5c5ffcb109
2020-06-22 11:22:58 +02:00
Waiman Long
e27d0385df mm: add kvfree_sensitive() for freeing sensitive data objects
[ Upstream commit d4eaa2837851db2bfed572898bfc17f9a9f9151e ]

For kvmalloc'ed data object that contains sensitive information like
cryptographic keys, we need to make sure that the buffer is always cleared
before freeing it.  Using memset() alone for buffer clearing may not
provide certainty as the compiler may compile it away.  To be sure, the
special memzero_explicit() has to be used.

This patch introduces a new kvfree_sensitive() for freeing those sensitive
data objects allocated by kvmalloc().  The relevant places where
kvfree_sensitive() can be used are modified to use it.

Fixes: 4f0882491a14 ("KEYS: Avoid false positive ENOMEM error on key read")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-20 10:24:59 +02:00
Jaegeuk Kim
dad710c56f Merge remote-tracking branch 'aosp/upstream-f2fs-stable-linux-4.14.y' into android-4.14
* aosp/upstream-f2fs-stable-linux-4.14.y:
  f2fs: fix race conditions in ->d_compare() and ->d_hash()
  f2fs: fix dcache lookup of !casefolded directories
  f2fs: Add f2fs stats to sysfs
  f2fs: delete duplicate information on sysfs nodes
  f2fs: change to use rwsem for gc_mutex
  f2fs: update f2fs document regarding to fsync_mode
  f2fs: add a way to turn off ipu bio cache
  f2fs: code cleanup for f2fs_statfs_project()
  f2fs: fix miscounted block limit in f2fs_statfs_project()
  f2fs: show the CP_PAUSE reason in checkpoint traces
  f2fs: fix deadlock allocating bio_post_read_ctx from mempool
  f2fs: remove unneeded check for error allocating bio_post_read_ctx
  f2fs: convert inline_dir early before starting rename
  f2fs: fix memleak of kobject
  f2fs: fix to add swap extent correctly
  mm: export add_swap_extent()
  f2fs: run fsck when getting bad inode during GC
  f2fs: support data compression
  f2fs: free sysfs kobject
  f2fs: declare nested quota_sem and remove unnecessary sems
  mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags
  f2fs: don't put new_page twice in f2fs_rename
  f2fs: set I_LINKABLE early to avoid wrong access by vfs
  f2fs: don't keep META_MAPPING pages used for moving verity file blocks
  f2fs: introduce private bioset
  f2fs: cleanup duplicate stats for atomic files
  f2fs: set GFP_NOFS when moving inline dentries
  f2fs: should avoid recursive filesystem ops
  f2fs: keep quota data on write_begin failure
  f2fs: call f2fs_balance_fs outside of locked page
  f2fs: preallocate DIO blocks when forcing buffered_io

Bug: 148667616
Change-Id: Ic885bdb3ef3a8b5d264497b9972b41bcd26b4e85
Signed-off-by: Jaegeuk Kim <jaegeuk@google.com>
2020-02-07 22:45:09 +00:00
Michal Hocko
859f5e4735 mm: kvmalloc does not fallback to vmalloc for incompatible gfp flags
kvmalloc warned about incompatible gfp_mask to catch abusers (mostly
GFP_NOFS) with an intention that this will motivate authors of the code
to fix those.  Linus argues that this just motivates people to do even
more hacks like

	if (gfp == GFP_KERNEL)
		kvmalloc
	else
		kmalloc

I haven't seen this happening much (Linus pointed to bucket_lock special
cases an atomic allocation but my git foo hasn't found much more) but it
is true that we can grow those in future.  Therefore Linus suggested to
simply not fallback to vmalloc for incompatible gfp flags and rather
stick with the kmalloc path.

Link: http://lkml.kernel.org/r/20180601115329.27807-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tom Herbert <tom@quantonium.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-28 10:37:32 -08:00
Vlastimil Babka
7ca01d96f4 BACKPORT: mm: rename and change semantics of nr_indirectly_reclaimable_bytes
The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
commit eb59254608bc ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
the goal of accounting objects that can be reclaimed, but cannot be
allocated via a SLAB_RECLAIM_ACCOUNT cache.  This is now possible via
kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
is converted.

The counter is however still useful for accounting direct page allocations
(i.e.  not slab) with a shrinker, such as the ION page pool.  So keep it,
and:

- change granularity to pages to be more like other counters; sub-page
  allocations should be able to use kmalloc
- rename the counter to NR_KERNEL_MISC_RECLAIMABLE
- expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
  again remove the check for not printing "hidden" counters

Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

(cherry picked from commit b29940c1abd7a4c3abeb926df0a5ec84d6902d47)

Conflicts:
        drivers/staging/android/ion/ion_page_pool.c

(1. NR_INDIRECTLY_RECLAIMABLE_BYTES accounting is absent, ignore it since
this patch replaces it anyway.)

Bug: 138148041
Test: verify KReclaimable accounting after ION allocation+deallocation
Change-Id: I6196eaa1e72f16dbde7a2894dc42435e75ae156c
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
2019-12-16 23:32:33 +00:00
Jan Stancek
e973b3929a mm: page_mapped: don't assume compound page is huge or THP
commit 8ab88c7169b7fba98812ead6524b9d05bc76cf00 upstream.

LTP proc01 testcase has been observed to rarely trigger crashes
on arm64:
    page_mapped+0x78/0xb4
    stable_page_flags+0x27c/0x338
    kpageflags_read+0xfc/0x164
    proc_reg_read+0x7c/0xb8
    __vfs_read+0x58/0x178
    vfs_read+0x90/0x14c
    SyS_read+0x60/0xc0

The issue is that page_mapped() assumes that if compound page is not
huge, then it must be THP.  But if this is 'normal' compound page
(COMPOUND_PAGE_DTOR), then following loop can keep running (for
HPAGE_PMD_NR iterations) until it tries to read from memory that isn't
mapped and triggers a panic:

        for (i = 0; i < hpage_nr_pages(page); i++) {
                if (atomic_read(&page[i]._mapcount) >= 0)
                        return true;
	}

I could replicate this on x86 (v4.20-rc4-98-g60b548237fed) only
with a custom kernel module [1] which:
 - allocates compound page (PAGEC) of order 1
 - allocates 2 normal pages (COPY), which are initialized to 0xff (to
   satisfy _mapcount >= 0)
 - 2 PAGEC page structs are copied to address of first COPY page
 - second page of COPY is marked as not present
 - call to page_mapped(COPY) now triggers fault on access to 2nd COPY
   page at offset 0x30 (_mapcount)

[1] https://github.com/jstancek/reproducers/blob/master/kernel/page_mapped_crash/repro.c

Fix the loop to iterate for "1 << compound_order" pages.

Kirrill said "IIRC, sound subsystem can producuce custom mapped compound
pages".

Link: http://lkml.kernel.org/r/c440d69879e34209feba21e12d236d06bc0a25db.1543577156.git.jstancek@redhat.com
Fixes: e1534ae95004 ("mm: differentiate page_mapped() from page_mapcount() for compound pages")
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Debugged-by: Laszlo Ersek <lersek@redhat.com>
Suggested-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16 22:07:11 +01:00
Roman Gushchin
5de69d648a mm: treat indirectly reclaimable memory as free in overcommit logic
commit d79f7aa496fc94d763f67b833a1f36f4c171176f upstream.

Indirectly reclaimable memory can consume a significant part of total
memory and it's actually reclaimable (it will be released under actual
memory pressure).

So, the overcommit logic should treat it as free.

Otherwise, it's possible to cause random system-wide memory allocation
failures by consuming a significant amount of memory by indirectly
reclaimable memory, e.g.  dentry external names.

If overcommit policy GUESS is used, it might be used for denial of
service attack under some conditions.

The following program illustrates the approach.  It causes the kernel to
allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
that at some point the overcommit logic may start blocking large
allocation system-wide.

  int main()
  {
  	char buf[256];
  	unsigned long i;
  	struct stat statbuf;

  	buf[0] = '/';
  	for (i = 1; i < sizeof(buf); i++)
  		buf[i] = '_';

  	for (i = 0; 1; i++) {
  		sprintf(&buf[248], "%8lu", i);
  		stat(buf, &statbuf);
  	}

  	return 0;
  }

This patch in combination with related indirectly reclaimable memory
patches closes this issue.

Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-18 09:16:25 +02:00
Michal Hocko
c41f012ade mm: rename global_page_state to global_zone_page_state
global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests.  We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:

$ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
      2 NR_BOUNCE
      2 NR_FREE_CMA_PAGES
     11 NR_FREE_PAGES
      1 NR_KERNEL_STACK_KB
      1 NR_MLOCK
      2 NR_PAGETABLE

This patch shouldn't introduce any functional change.

[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp

Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:29 -07:00
Johannes Weiner
d507e2ebd2 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
As Tetsuo points out:
 "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
  node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
  0kB"

In addition to /proc/meminfo, this problem also affects the slab
counters OOM/allocation failure info dumps, can cause early -ENOMEM from
overcommit protection, and miscalculate image size requirements during
suspend-to-disk.

This is because the patch in question switched the slab counters from
the zone level to the node level, but forgot to update the global
accessor functions to read the aggregate node data instead of the
aggregate zone data.

Use global_node_page_state() to access the global slab counters.

Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Agner <stefan@agner.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:06 -07:00
Linus Torvalds
78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
Michal Hocko
cc965a29db mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
request size we can drop the hackish implementation for !costly orders.
__GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
progress and backs of when we are out of memory for the requested size.
Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
to silent the oom killer anymore.

Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Michal Hocko
dcda9b0471 mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
David Howells
f351574172 Provide a function to create a NUL-terminated string from unterminated data
Provide a function, kmemdup_nul(), that will create a NUL-terminated string
from an unterminated character array where the length is known in advance.

This is better than kstrndup() in situations where we already know the
string length as the strnlen() in kstrndup() is superfluous.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-06 03:27:09 -04:00
Michal Hocko
4f4f2ba9c5 mm: clarify why we want kmalloc before falling backto vmallock
While converting drm_[cm]alloc* helpers to kvmalloc* variants Chris
Wilson has wondered why we want to try kmalloc before vmalloc fallback
even for larger allocations requests.  Let's clarify that one larger
physically contiguous block is less likely to fragment memory than many
scattered pages which can prevent more large blocks from being created.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20170517080932.21423-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-02 15:07:37 -07:00
Michal Hocko
8594a21cf7 mm, vmalloc: fix vmalloc users tracking properly
Commit 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users") has
pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
turned out to be a bad idea for some architectures.  E.g.  m68k fails
with

   In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
                    from arch/m68k/include/asm/pgtable.h:4,
                    from include/linux/vmalloc.h:9,
                    from arch/m68k/kernel/module.c:9:
   arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
>> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
    #define pgd_offset_k(address) pgd_offset(&init_mm, address)

as spotted by kernel build bot. nios2 fails for other reason

  In file included from include/asm-generic/io.h:767:0,
                   from arch/nios2/include/asm/io.h:61,
                   from include/linux/io.h:25,
                   from arch/nios2/include/asm/pgtable.h:18,
                   from include/linux/mm.h:70,
                   from include/linux/pid_namespace.h:6,
                   from include/linux/ptrace.h:9,
                   from arch/nios2/include/uapi/asm/elf.h:23,
                   from arch/nios2/include/asm/elf.h:22,
                   from include/linux/elf.h:4,
                   from include/linux/module.h:15,
                   from init/main.c:16:
  include/linux/vmalloc.h: In function '__vmalloc_node_flags':
  include/linux/vmalloc.h:99:40: error: 'PAGE_KERNEL' undeclared (first use in this function); did you mean 'GFP_KERNEL'?

which is due to the newly added #include <asm/pgtable.h>, which on nios2
includes <linux/io.h> and thus <asm/io.h> and <asm-generic/io.h> which
again includes <linux/vmalloc.h>.

Tweaking that around just turns out a bigger headache than necessary.
This patch reverts 1f5307b1e094 and reimplements the original fix in a
different way.  __vmalloc_node_flags can stay static inline which will
cover vmalloc* functions.  We only have one external user
(kvmalloc_node) and we can export __vmalloc_node_flags_caller and
provide the caller directly.  This is much simpler and it doesn't really
need any games with header files.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@kernel.org: revert old comment]
  Link: http://lkml.kernel.org/r/20170509211054.GB16325@dhcp22.suse.cz
Fixes: 1f5307b1e094 ("mm, vmalloc: properly track vmalloc users")
Link: http://lkml.kernel.org/r/20170509153702.GR6481@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tobias Klauser <tklauser@distanz.ch>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-12 15:57:15 -07:00
Michal Hocko
19809c2da2 mm, vmalloc: use __GFP_HIGHMEM implicitly
__vmalloc* allows users to provide gfp flags for the underlying
allocation.  This API is quite popular

  $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
  77

The only problem is that many people are not aware that they really want
to give __GFP_HIGHMEM along with other flags because there is really no
reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
which are mapped to the kernel vmalloc space.  About half of users don't
use this flag, though.  This signals that we make the API unnecessarily
too complex.

This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
are simplified and drop the flag.

Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Cristopher Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:13 -07:00
Michal Hocko
6c5ab6511f mm: support __GFP_REPEAT in kvmalloc_node for >32kB
vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
vhost_vsock because it would really like to prefer kmalloc to the
vmalloc fallback - see 23cc5a991c7a ("vhost-net: extend device
allocation to vmalloc") for more context.  Michael Tsirkin has also
noted:

 "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
  all accesses are slowed down. Allocation is not on data path, accesses
  are."

The similar applies to other vhost_kvzalloc users.

Let's teach kvmalloc_node to handle __GFP_REPEAT properly.  There are
two things to be careful about.  First we should prevent from the OOM
killer and so have to involve __GFP_NORETRY by default and secondly
override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
ignored for !costly orders.

Supporting __GFP_REPEAT like semantic for !costly request is possible it
would require changes in the page allocator.  This is out of scope of
this patch.

This patch shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Michal Hocko
a7c3e901a4 mm: introduce kv[mz]alloc helpers
Patch series "kvmalloc", v5.

There are many open coded kmalloc with vmalloc fallback instances in the
tree.  Most of them are not careful enough or simply do not care about
the underlying semantic of the kmalloc/page allocator which means that
a) some vmalloc fallbacks are basically unreachable because the kmalloc
part will keep retrying until it succeeds b) the page allocator can
invoke a really disruptive steps like the OOM killer to move forward
which doesn't sound appropriate when we consider that the vmalloc
fallback is available.

As it can be seen implementing kvmalloc requires quite an intimate
knowledge if the page allocator and the memory reclaim internals which
strongly suggests that a helper should be implemented in the memory
subsystem proper.

Most callers, I could find, have been converted to use the helper
instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
in the networking stack which I have converted as well and Eric Dumazet
was not opposed [2] to convert them as well.

[1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com

This patch (of 9):

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code.  Yet we do not have any common helper
for that and so users have invented their own helpers.  Some of them are
really creative when doing so.  Let's just add kv[mz]alloc and make sure
it is implemented properly.  This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures.  This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them.  In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
require GFP_NO{FS,IO} context which is not vmalloc compatible in general
(note that the page table allocation is GFP_KERNEL).  Those need to be
fixed separately.

While we are at it, document that __vmalloc{_node} about unsupported gfp
mask because there seems to be a lot of confusion out there.
kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
superset) flags to catch new abusers.  Existing ones would have to die
slowly.

[sfr@canb.auug.org.au: f2fs fixup]
  Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08 17:15:12 -07:00
Ingo Molnar
68db0cf106 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h>
We are going to split <linux/sched/task_stack.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task_stack.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:36 +01:00
Ingo Molnar
6e84f31522 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h>
We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/mm.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

The APIs that are going to be moved first are:

   mm_alloc()
   __mmdrop()
   mmdrop()
   mmdrop_async_fn()
   mmdrop_async()
   mmget_not_zero()
   mmput()
   mmput_async()
   get_task_mm()
   mm_access()
   mm_release()

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:28 +01:00
Mike Rapoport
897ab3e0c4 userfaultfd: non-cooperative: add event for memory unmaps
When a non-cooperative userfaultfd monitor copies pages in the
background, it may encounter regions that were already unmapped.
Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely
changes in the virtual memory layout.

Since there might be different uffd contexts for the affected VMAs, we
first should create a temporary representation for the unmap event for
each uffd context and then notify them one by one to the appropriate
userfault file descriptors.

The event notification occurs after the mmap_sem has been released.

[arnd@arndb.de: fix nommu build]
  Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de
[mhocko@suse.com: fix nommu build]
  Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:55 -08:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Linus Torvalds
86c5bf7101 Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull vmap stack fixes from Ingo Molnar:
 "This is fallout from CONFIG_HAVE_ARCH_VMAP_STACK=y on x86: stack
  accesses that used to be just somewhat questionable are now totally
  buggy.

  These changes try to do it without breaking the ABI: the fields are
  left there, they are just reporting zero, or reporting narrower
  information (the maps file change)"

* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
  fs/proc: Stop trying to report thread stacks
  fs/proc: Stop reporting eip and esp in /proc/PID/stat
  mm/numa: Remove duplicated include from mprotect.c
2016-10-22 09:39:10 -07:00
Andy Lutomirski
d17af5056c mm: Change vm_is_stack_for_task() to vm_is_stack_for_current()
Asking for a non-current task's stack can't be done without races
unless the task is frozen in kernel mode.  As far as I know,
vm_is_stack_for_task() never had a safe non-current use case.

The __unused annotation is because some KSTK_ESP implementations
ignore their parameter, which IMO is further justification for this
patch.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Jann Horn <jann@thejh.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux API <linux-api@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tycho Andersen <tycho.andersen@canonical.com>
Link: http://lkml.kernel.org/r/4c3f68f426e6c061ca98b4fc7ef85ffbb0a25b0c.1475257877.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-20 09:21:41 +02:00
Lorenzo Stoakes
f307ab6dce mm: replace access_process_vm() write parameter with gup_flags
This removes the 'write' argument from access_process_vm() and replaces
it with 'gup_flags' as use of this function previously silently implied
FOLL_FORCE, whereas after this patch callers explicitly pass this flag.

We make this explicit as use of FOLL_FORCE can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-19 08:31:25 -07:00
Lorenzo Stoakes
c164154f66 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags
This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-18 14:13:37 -07:00
Mel Gorman
11fb998986 mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Kirill A. Shutemov
dd78fedde4 rmap: support file thp
Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page.  That's not efficient, but it's not obvious
how we can optimize this.  We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim
bda807d444 mm: migrate: support non-lru movable page migration
We have allowed migration for only LRU pages until now and it was enough
to make high-order pages.  But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,.  enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable.  For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully.  On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation.  If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
		struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.  The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage.  Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0.  If driver cannot migrate the page at the
moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure".  On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page.  In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under
page_lock.

	void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM.  Exactly speaking, PG_movable is
not a real flag of struct page.  Rather than, VM reuses page->mapping's
lower bits to represent it.

	#define PAGE_MAPPING_MOVABLE 0x2
	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly.  Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.  As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable).  But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated.  Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
non-lru movable page, it can skip it.  Driver doesn't need to manipulate
the flag because VM will set/clear it automatically.  Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field.  PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
  Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michal Hocko
9fbeb5ab59 mm: make vm_mmap killable
All the callers of vm_mmap seem to check for the failure already and
bail out in one way or another on the error which means that we can
change it to use killable version of vm_mmap_pgoff and return -EINTR if
the current task gets killed while waiting for mmap_sem.  This also
means that vm_mmap_pgoff can be killable by default and drop the
additional parameter.

This will help in the OOM conditions when the oom victim might be stuck
waiting for the mmap_sem for write which in turn can block oom_reaper
which relies on the mmap_sem for read to make a forward progress and
reclaim the address space of the victim.

Please note that load_elf_binary is ignoring vm_mmap error for
current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
problem because the address is not used anywhere and we never return to
the userspace if we got killed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Michal Hocko
dc0ef0df7b mm: make mmap_sem for write waits killable for mm syscalls
This is a follow up work for oom_reaper [1].  As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way.  This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.

Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending).  Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.

As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).

This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree.  I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.

I haven't covered all the mmap_write(mm->mmap_sem) instances here

  $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
  98
  $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
  62

I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement.  Other
places can be changed on top.

[0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
[1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

This patch (of 18):

This is the first step in making mmap_sem write waiters killable.  It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.

Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR.  This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress.  E.g.  the oom reaper will need the lock for reading to
dismantle the OOM victim address space.

The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
original non-killable semantic for now.

vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Andrew Morton
1aa8aea535 mm: uninline page_mapped()
It's huge.  Uninlining it saves 206 bytes per callsite.  Shaves 4924
bytes from the x86_64 allmodconfig vmlinux.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Steve Capper <steve.capper@arm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Linus Torvalds
643ad15d47 Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 protection key support from Ingo Molnar:
 "This tree adds support for a new memory protection hardware feature
  that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

  There's a background article at LWN.net:

      https://lwn.net/Articles/643797/

  The gist is that protection keys allow the encoding of
  user-controllable permission masks in the pte.  So instead of having a
  fixed protection mask in the pte (which needs a system call to change
  and works on a per page basis), the user can map a (handful of)
  protection mask variants and can change the masks runtime relatively
  cheaply, without having to change every single page in the affected
  virtual memory range.

  This allows the dynamic switching of the protection bits of large
  amounts of virtual memory, via user-space instructions.  It also
  allows more precise control of MMU permission bits: for example the
  executable bit is separate from the read bit (see more about that
  below).

  This tree adds the MM infrastructure and low level x86 glue needed for
  that, plus it adds a high level API to make use of protection keys -
  if a user-space application calls:

        mmap(..., PROT_EXEC);

  or

        mprotect(ptr, sz, PROT_EXEC);

  (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
  this special case, and will set a special protection key on this
  memory range.  It also sets the appropriate bits in the Protection
  Keys User Rights (PKRU) register so that the memory becomes unreadable
  and unwritable.

  So using protection keys the kernel is able to implement 'true'
  PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
  PROT_READ as well.  Unreadable executable mappings have security
  advantages: they cannot be read via information leaks to figure out
  ASLR details, nor can they be scanned for ROP gadgets - and they
  cannot be used by exploits for data purposes either.

  We know about no user-space code that relies on pure PROT_EXEC
  mappings today, but binary loaders could start making use of this new
  feature to map binaries and libraries in a more secure fashion.

  There is other pending pkeys work that offers more high level system
  call APIs to manage protection keys - but those are not part of this
  pull request.

  Right now there's a Kconfig that controls this feature
  (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
  (like most x86 CPU feature enablement code that has no runtime
  overhead), but it's not user-configurable at the moment.  If there's
  any serious problem with this then we can make it configurable and/or
  flip the default"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
  mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
  x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
  mm/core, x86/mm/pkeys: Add execute-only protection keys support
  x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
  x86/mm/pkeys: Allow kernel to modify user pkey rights register
  x86/fpu: Allow setting of XSAVE state
  x86/mm: Factor out LDT init from context init
  mm/core, x86/mm/pkeys: Add arch_validate_pkey()
  mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
  x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
  x86/mm/pkeys: Add Kconfig prompt to existing config option
  x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
  x86/mm/pkeys: Dump PKRU with other kernel registers
  mm/core, x86/mm/pkeys: Differentiate instruction fetches
  x86/mm/pkeys: Optimize fault handling in access_error()
  mm/core: Do not enforce PKEY permissions on remote mm access
  um, pkeys: Add UML arch_*_access_permitted() methods
  mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
  x86/mm/gup: Simplify get_user_pages() PTE bit handling
  ...
2016-03-20 19:08:56 -07:00
Andrey Ryabinin
39a1aa8e19 mm: deduplicate memory overcommitment code
Currently we have two copies of the same code which implements memory
overcommitment logic.  Let's move it into mm/util.c and hence avoid
duplication.  No functional changes here.

Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Dave Hansen
cde70140fe mm/gup: Overload get_user_pages() functions
The concept here was a suggestion from Ingo.  The implementation
horrors are all mine.

This allows get_user_pages(), get_user_pages_unlocked(), and
get_user_pages_locked() to be called with or without the
leading tsk/mm arguments.  We will give a compile-time warning
about the old style being __deprecated and we will also
WARN_ON() if the non-remote version is used for a remote-style
access.

Doing this, folks will get nice warnings and will not break the
build.  This should be nice for -next and will hopefully let
developers fix up their own code instead of maintainers needing
to do it at merge time.

The way we do this is hideous.  It uses the __VA_ARGS__ macro
functionality to call different functions based on the number
of arguments passed to the macro.

There's an additional hack to ensure that our EXPORT_SYMBOL()
of the deprecated symbols doesn't trigger a warning.

We should be able to remove this mess as soon as -rc1 hits in
the release after this is merged.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
Cc: Geliang Tang <geliangtang@163.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Leon Romanovsky <leon@leon.nu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-16 10:11:12 +01:00
Johannes Weiner
65376df582 proc: revert /proc/<pid>/maps [stack:TID] annotation
Commit b76437579d13 ("procfs: mark thread stack correctly in
proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps.

Finding the task of a stack VMA requires walking the entire thread list,
turning this into quadratic behavior: a thousand threads means a
thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a
million combinations.

The cost is not in proportion to the usefulness as described in the
patch.

Drop the [stack:TID] annotation to make /proc/<pid>/maps (and
/proc/<pid>/numa_maps) usable again for higher thread counts.

The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as
identifying the stack VMA there is an O(1) operation.

Siddesh said:
 "The end users needed a way to identify thread stacks programmatically and
  there wasn't a way to do that.  I'm afraid I no longer remember (or have
  access to the resources that would aid my memory since I changed
  employers) the details of their requirement.  However, I did do this on my
  own time because I thought it was an interesting project for me and nobody
  really gave any feedback then as to its utility, so as far as I am
  concerned you could roll back the main thread maps information since the
  information is available in the thread-specific files"

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-03 08:28:43 -08:00
Mateusz Guzik
a3b609ef9f proc read mm's {arg,env}_{start,end} with mmap semaphore taken.
Only functions doing more than one read are modified.  Consumeres
happened to deal with possibly changing data, but it does not seem like
a good thing to rely on.

Signed-off-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Jan Stancek <jstancek@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Anshuman Khandual <anshuman.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Kirill A. Shutemov
b20ce5e03b mm: prepare page_referenced() and page_idle to new THP refcounting
Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages.  That's no true anymore: THP can be mapped
with PTEs too.

The patch removes PageTransHuge() test from the functions and opencode
page table check.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov
1c290f6421 mm: sanitize page->mapping for tail pages
We don't define meaning of page->mapping for tail pages.  Currently it's
always NULL, which can be inconsistent with head page and potentially
lead to problems.

Let's poison the pointer to catch all illigal uses.

page_rmapping(), page_mapping() and page_anon_vma() are changed to look
on head page.

The only illegal use I've caught so far is __GPF_COMP pages from sound
subsystem, mapped with PTEs.  do_shared_fault() is changed to use
page_rmapping() instead of direct access to fault_page->mapping.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Al Viro
e9d408e107 new helper: memdup_user_nul()
Similar to memdup_user(), except that allocated buffer is one byte
longer and '\0' is stored after the copied data.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-04 10:20:19 -05:00
Alexander Kuleshov
ea53cde089 mm/util: use offset_in_page macro
linux/mm.h provides offset_in_page() macro.  Let's use already predefined
macro instead of (addr & ~PAGE_MASK).

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Kirill A. Shutemov
e39155ea11 mm: uninline and cleanup page-mapping related helpers
Most-used page->mapping helper -- page_mapping() -- has already uninlined.
 Let's uninline also page_rmapping() and page_anon_vma().  It saves us
depending on configuration around 400 bytes in text:

   text	   data	    bss	    dec	    hex	filename
 660318	  99254	 410000	1169572	 11d8a4	mm/built-in.o-before
 659854	  99254	 410000	1169108	 11d6d4	mm/built-in.o

I also tried to make code a bit more clean.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:19 -07:00
Andrzej Hajda
a4bb1e43e2 mm/util: add kstrdup_const
kstrdup() is often used to duplicate strings where neither source neither
destination will be ever modified.  In such case we can just reuse the
source instead of duplicating it.  The problem is that we must be sure
that the source is non-modifiable and its life-time is long enough.

I suspect the good candidates for such strings are strings located in
kernel .rodata section, they cannot be modifed because the section is
read-only and their life-time is equal to kernel life-time.

This small patchset proposes alternative version of kstrdup -
kstrdup_const, which returns source string if it is located in .rodata
otherwise it fallbacks to kstrdup.  To verify if the source is in
.rodata function checks if the address is between sentinels
__start_rodata, __end_rodata.  I guess it should work with all
architectures.

The main patch is accompanied by four patches constifying kstrdup for
cases where situtation described above happens frequently.

I have tested the patchset on mobile platform (exynos4210-trats) and it
saves 3272 string allocations.  Since minimal allocation is 32 or 64
bytes depending on Kconfig options the patchset saves respectively about
100KB or 200KB of memory.

Stats from tested platform show that the main offender is sysfs:

By caller:
  2260 __kernfs_new_node
    631 clk_register+0xc8/0x1b8
    318 clk_register+0x34/0x1b8
      51 kmem_cache_create
      12 alloc_vfsmnt

By string (with count >= 5):
    883 power
    876 subsystem
    135 parameters
    132 device
     61 iommu_group
    ...

This patch (of 5):

Add an alternative version of kstrdup which returns pointer to constant
char array.  The function checks if input string is in persistent and
read-only memory section, if yes it returns the input string, otherwise it
fallbacks to kstrdup.

kstrdup_const is accompanied by kfree_const performing conditional memory
deallocation of the string.

Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg KH <greg@kroah.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:35 -08:00
Andrea Arcangeli
a7b780750e mm: gup: use get_user_pages_unlocked within get_user_pages_fast
This allows the get_user_pages_fast slow path to release the mmap_sem
before blocking.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:05 -08:00
Oleg Nesterov
58cb65487e proc/maps: make vm_is_stack() logic namespace-friendly
- Rename vm_is_stack() to task_of_stack() and change it to return
  "struct task_struct *" rather than the global (and thus wrong in
  general) pid_t.

- Add the new pid_of_stack() helper which calls task_of_stack() and
  uses the right namespace to report the correct pid_t.

  Unfortunately we need to define this helper twice, in task_mmu.c
  and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
  and move at least pid_of_stack/task_of_stack there to avoid the
  code duplication.

- Change show_map_vma() and show_numa_map() to use the new helper.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:50 -04:00
Oleg Nesterov
4449a51a7c vm_is_stack: use for_each_thread() rather then buggy while_each_thread()
Aleksei hit the soft lockup during reading /proc/PID/smaps.  David
investigated the problem and suggested the right fix.

while_each_thread() is racy and should die, this patch updates
vm_is_stack().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Aleksei Besogonov <alex.besogonov@gmail.com>
Tested-by: Aleksei Besogonov <alex.besogonov@gmail.com>
Suggested-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:17 -07:00
Andrey Ryabinin
928cec9cd6 mm: move slab related stuff from util.c to slab_common.c
Functions krealloc(), __krealloc(), kzfree() belongs to slab API, so
should be placed in slab_common.c

Also move slab allocator's tracepoints defenitions to slab_common.c No
functional changes here.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:15 -07:00