157 Commits

Author SHA1 Message Date
Greg Kroah-Hartman
93599f65c3 This is the 4.14.201 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl+GrkEACgkQONu9yGCS
 aT58SQ/+PLxjpiE1Mn0CRjYBclZXvuhrJOP0KOqLNIM7/eXn3pbxS9wKjn9ykTM+
 KTa5s1y0IXDaWYs4lsEnfIKKXDmLHfwnj959StIR6gW+16/cSqppKpiq14MPhkOE
 WMLvvXOUKfAGMCEzsCoof6Qu/in302DoBK6Nvec53PFeAl+yWaJV4dnIGJpZQtZF
 O2A/gVL2Fqvk2O1v6wRqWfaBPFBNePOCdMcGrTWwH8JnoSuk8VGad6AWvOTakbny
 xeRyzKhoPGXiKCiwbNU71IhXO6X5fG7Q/bnS+uZ91186FsHUEMRQeDWPWqz3OqEw
 Xa/1SSSK0bkFzLn8U0XF0Xe8Txadr/ZDc2EeRlFe0pUVO/kBrGbnT9u7erv3/Ry3
 DPPI/JeHg2onsVlnHZLAqFegA6JpGr8FiWQxgMIQ0CtklxVM123dYw8XNXS8Zr/c
 qeWKGtcpacXR+6fogtPF7HEHma59+XP2hawICgH25JOKa6MeqsaQdM5YAS2DymVV
 fhzfEj1a851KjesPM/axbQifJVjgDud2vbbv19hVMaWWDLXH/vhB+QNGeI3wAjJn
 0QuUe5kUASFy1HrleCmFQUEjOIxTKE87l2vEHzkkOnjgmWpNF/T+SR5MutCrhV8h
 9sl3QIT7zqIYci+x8oK8E2X9d2bGmGN30NfqgHo+iL47DZXKSCc=
 =tX82
 -----END PGP SIGNATURE-----

Merge 4.14.201 into android-4.14-stable

Changes in 4.14.201
	vsock/virtio: use RCU to avoid use-after-free on the_virtio_vsock
	vsock/virtio: stop workers during the .remove()
	vsock/virtio: add transport parameter to the virtio_transport_reset_no_sock()
	net: virtio_vsock: Enhance connection semantics
	USB: gadget: f_ncm: Fix NDP16 datagram validation
	gpio: tc35894: fix up tc35894 interrupt configuration
	Input: i8042 - add nopnp quirk for Acer Aspire 5 A515
	drm/amdgpu: restore proper ref count in amdgpu_display_crtc_set_config
	drivers/net/wan/hdlc_fr: Add needed_headroom for PVC devices
	drm/sun4i: mixer: Extend regmap max_register
	net: dec: de2104x: Increase receive ring size for Tulip
	rndis_host: increase sleep time in the query-response loop
	drivers/net/wan/lapbether: Make skb->protocol consistent with the header
	drivers/net/wan/hdlc: Set skb->protocol before transmitting
	mac80211: do not allow bigger VHT MPDUs than the hardware supports
	spi: fsl-espi: Only process interrupts for expected events
	nvme-fc: fail new connections to a deleted host or remote port
	pinctrl: mvebu: Fix i2c sda definition for 98DX3236
	nfs: Fix security label length not being reset
	clk: samsung: exynos4: mark 'chipid' clock as CLK_IGNORE_UNUSED
	iommu/exynos: add missing put_device() call in exynos_iommu_of_xlate()
	i2c: cpm: Fix i2c_ram structure
	Input: trackpoint - enable Synaptics trackpoints
	random32: Restore __latent_entropy attribute on net_rand_state
	net/packet: fix overflow in tpacket_rcv
	epoll: do not insert into poll queues until all sanity checks are done
	epoll: replace ->visited/visited_list with generation count
	epoll: EPOLL_CTL_ADD: close the race in decision to take fast path
	ep_create_wakeup_source(): dentry name can change under you...
	netfilter: ctnetlink: add a range check for l3/l4 protonum
	drm/syncobj: Fix drm_syncobj_handle_to_fd refcount leak
	fbdev, newport_con: Move FONT_EXTRA_WORDS macros into linux/font.h
	Fonts: Support FONT_EXTRA_WORDS macros for built-in fonts
	Revert "ravb: Fixed to be able to unload modules"
	fbcon: Fix global-out-of-bounds read in fbcon_get_font()
	net: wireless: nl80211: fix out-of-bounds access in nl80211_del_key()
	usermodehelper: reset umask to default before executing user process
	platform/x86: thinkpad_acpi: initialize tp_nvram_state variable
	platform/x86: thinkpad_acpi: re-initialize ACPI buffer size when reuse
	driver core: Fix probe_count imbalance in really_probe()
	perf top: Fix stdio interface input handling with glibc 2.28+
	mtd: rawnand: sunxi: Fix the probe error path
	Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
	ftrace: Move RCU is watching check after recursion check
	macsec: avoid use-after-free in macsec_handle_frame()
	mm/khugepaged: fix filemap page_to_pgoff(page) != offset
	cifs: Fix incomplete memory allocation on setxattr path
	i2c: meson: fix clock setting overwrite
	sctp: fix sctp_auth_init_hmacs() error path
	team: set dev->needed_headroom in team_setup_by_port()
	net: team: fix memory leak in __team_options_register
	openvswitch: handle DNAT tuple collision
	drm/amdgpu: prevent double kfree ttm->sg
	xfrm: clone XFRMA_REPLAY_ESN_VAL in xfrm_do_migrate
	xfrm: clone XFRMA_SEC_CTX in xfrm_do_migrate
	xfrm: clone whole liftime_cur structure in xfrm_do_migrate
	net: stmmac: removed enabling eee in EEE set callback
	platform/x86: fix kconfig dependency warning for FUJITSU_LAPTOP
	xfrm: Use correct address family in xfrm_state_find
	bonding: set dev->needed_headroom in bond_setup_by_slave()
	mdio: fix mdio-thunder.c dependency & build error
	net: usb: ax88179_178a: fix missing stop entry in driver_info
	rxrpc: Fix rxkad token xdr encoding
	rxrpc: Downgrade the BUG() for unsupported token type in rxrpc_read()
	rxrpc: Fix some missing _bh annotations on locking conn->state_lock
	rxrpc: Fix server keyring leak
	perf: Fix task_function_call() error handling
	mmc: core: don't set limits.discard_granularity as 0
	mm: khugepaged: recalculate min_free_kbytes after memory hotplug as expected by khugepaged
	net: usb: rtl8150: set random MAC address when set_ethernet_addr() fails
	Linux 4.14.201

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Iffb5ee67b94a852de1bd865817587bc27320f28b
2020-10-14 12:08:21 +02:00
Al Viro
a3915080e9 ep_create_wakeup_source(): dentry name can change under you...
commit 3701cb59d892b88d569427586f01491552f377b1 upstream.

or get freed, for that matter, if it's a long (separately stored)
name.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 09:51:09 +02:00
Al Viro
69c144b13d epoll: EPOLL_CTL_ADD: close the race in decision to take fast path
commit fe0a916c1eae8e17e86c3753d13919177d63ed7e upstream.

Checking for the lack of epitems refering to the epoll we want to insert into
is not enough; we might have an insertion of that epoll into another one that
has already collected the set of files to recheck for excessive reverse paths,
but hasn't gotten to creating/inserting the epitem for it.

However, any such insertion in progress can be detected - it will update the
generation count in our epoll when it's done looking through it for files
to check.  That gets done under ->mtx of our epoll and that allows us to
detect that safely.

We are *not* holding epmutex here, so the generation count is not stable.
However, since both the update of ep->gen by loop check and (later)
insertion into ->f_ep_link are done with ep->mtx held, we are fine -
the sequence is
	grab epmutex
	bump loop_check_gen
	...
	grab tep->mtx		// 1
	tep->gen = loop_check_gen
	...
	drop tep->mtx		// 2
	...
	grab tep->mtx		// 3
	...
	insert into ->f_ep_link
	...
	drop tep->mtx		// 4
	bump loop_check_gen
	drop epmutex
and if the fastpath check in another thread happens for that
eventpoll, it can come
	* before (1) - in that case fastpath is just fine
	* after (4) - we'll see non-empty ->f_ep_link, slow path
taken
	* between (2) and (3) - loop_check_gen is stable,
with ->mtx providing barriers and we end up taking slow path.

Note that ->f_ep_link emptiness check is slightly racy - we are protected
against insertions into that list, but removals can happen right under us.
Not a problem - in the worst case we'll end up taking a slow path for
no good reason.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 09:51:09 +02:00
Al Viro
4aefd05cfd epoll: replace ->visited/visited_list with generation count
commit 18306c404abe18a0972587a6266830583c60c928 upstream.

removes the need to clear it, along with the races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 09:51:09 +02:00
Al Viro
23fb662b13 epoll: do not insert into poll queues until all sanity checks are done
commit f8d4f44df056c5b504b0d49683fb7279218fd207 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14 09:51:09 +02:00
Greg Kroah-Hartman
601bfacceb This is the 4.14.197 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl9ZCusACgkQONu9yGCS
 aT7qhxAA1ow01f4+8syC7RlzqjoSBWK8kp+e9DqvjLd1jCDAJedZts2SwTTL+wh4
 agJ4JzmWhZhYka4fmv03kfuCL7niOQNKfmGNIfREnQaznG75C5w5Esc36cxxk6tu
 DzwxHoGSs9KqSAa7mLxI6tkEJfxy8c0OANX9tCypH6kPhBmFVhq6bNr1fxg2DE8I
 vLTJwTHD6fUBAPUsBnxF6Mx8jOyqqU7tk94eqU3DeWQygR8ZXIaSESNRxQWe1CVm
 UWHb5BcBME8UgML9hPkSpcz6FL9qTjRZxtHW+hrzZiycW1gpsoGNC6S1EvoTDcsX
 zBjclMF2CEIGRY/MTyl9K2XNxgAhLYNtlFSf+xXm1p2IaAaN2+qLwdGLjR/M2vak
 /LGQRdfLWzO3KxoXgOPnsdTnM8tFVUSDgq4F7Dkwhvu6y4M2iaF0mBSWg1EqIbAZ
 vCfIarjbsLD74Bu1K5dMUi6ZnFqLm2jdbgwL7wH0kgKToKWqox5ds4+Gods9WKd7
 VJfY29/FZ2v/2/C9Ia5dDf0wLTF9WQIeuk+BlUu/H3XQLWchUFT/BLDX7vAqCw08
 21dO9dP0sGgdNpZSJRR11QUn4tta13XLPuDIquM4IWa24ET/4RpYdHIXFhPfmYyz
 H84Xep/Joc4+rmUZVsGRSMtWooxpZU8vhcQ23uNhsiTxMANBFUE=
 =rMMq
 -----END PGP SIGNATURE-----

Merge 4.14.197 into android-4.14-stable

Changes in 4.14.197
	HID: core: Correctly handle ReportSize being zero
	HID: core: Sanitize event code and type when mapping input
	perf record/stat: Explicitly call out event modifiers in the documentation
	drm/msm: add shutdown support for display platform_driver
	hwmon: (applesmc) check status earlier.
	nvmet: Disable keep-alive timer when kato is cleared to 0h
	ceph: don't allow setlease on cephfs
	cpuidle: Fixup IRQ state
	s390: don't trace preemption in percpu macros
	xen/xenbus: Fix granting of vmalloc'd memory
	dmaengine: of-dma: Fix of_dma_router_xlate's of_dma_xlate handling
	batman-adv: Avoid uninitialized chaddr when handling DHCP
	batman-adv: Fix own OGM check in aggregated OGMs
	batman-adv: bla: use netif_rx_ni when not in interrupt context
	dmaengine: at_hdmac: check return value of of_find_device_by_node() in at_dma_xlate()
	MIPS: mm: BMIPS5000 has inclusive physical caches
	MIPS: BMIPS: Also call bmips_cpu_setup() for secondary cores
	netfilter: nf_tables: add NFTA_SET_USERDATA if not null
	netfilter: nf_tables: incorrect enum nft_list_attributes definition
	netfilter: nf_tables: fix destination register zeroing
	net: hns: Fix memleak in hns_nic_dev_probe
	net: systemport: Fix memleak in bcm_sysport_probe
	ravb: Fixed to be able to unload modules
	net: arc_emac: Fix memleak in arc_mdio_probe
	dmaengine: pl330: Fix burst length if burst size is smaller than bus width
	gtp: add GTPA_LINK info to msg sent to userspace
	bnxt_en: Check for zero dir entries in NVRAM.
	bnxt_en: Fix PCI AER error recovery flow
	nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
	perf tools: Correct SNOOPX field offset
	net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
	fix regression in "epoll: Keep a reference on files added to the check list"
	tg3: Fix soft lockup when tg3_reset_task() fails.
	iommu/vt-d: Serialize IOMMU GCMD register modifications
	thermal: ti-soc-thermal: Fix bogus thermal shutdowns for omap4430
	include/linux/log2.h: add missing () around n in roundup_pow_of_two()
	btrfs: drop path before adding new uuid tree entry
	btrfs: Remove redundant extent_buffer_get in get_old_root
	btrfs: Remove extraneous extent_buffer_get from tree_mod_log_rewind
	btrfs: set the lockdep class for log tree extent buffers
	uaccess: Add non-pagefault user-space read functions
	uaccess: Add non-pagefault user-space write function
	btrfs: fix potential deadlock in the search ioctl
	net: usb: qmi_wwan: add Telit 0x1050 composition
	usb: qmi_wwan: add D-Link DWM-222 A2 device ID
	ALSA: ca0106: fix error code handling
	ALSA: pcm: oss: Remove superfluous WARN_ON() for mulaw sanity check
	ALSA: hda/hdmi: always check pin power status in i915 pin fixup
	ALSA: firewire-digi00x: exclude Avid Adrenaline from detection
	affs: fix basic permission bits to actually work
	block: allow for_each_bvec to support zero len bvec
	block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into <linux/blkdev.h>
	libata: implement ATA_HORKAGE_MAX_TRIM_128M and apply to Sandisks
	dm cache metadata: Avoid returning cmd->bm wild pointer on error
	dm thin metadata: Avoid returning cmd->bm wild pointer on error
	mm: slub: fix conversion of freelist_corrupted()
	KVM: arm64: Add kvm_extable for vaxorcism code
	KVM: arm64: Defer guest entry when an asynchronous exception is pending
	KVM: arm64: Survive synchronous exceptions caused by AT instructions
	KVM: arm64: Set HCR_EL2.PTW to prevent AT taking synchronous exception
	checkpatch: fix the usage of capture group ( ... )
	mm/hugetlb: fix a race between hugetlb sysctl handlers
	cfg80211: regulatory: reject invalid hints
	net: usb: Fix uninit-was-stored issue in asix_read_phy_addr()
	Linux 4.14.197

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I59da148f114d7f8fd84fc76c8178081ddaf26a49
2020-09-09 19:35:54 +02:00
Al Viro
c5c6e00f6c fix regression in "epoll: Keep a reference on files added to the check list"
[ Upstream commit 77f4689de17c0887775bb77896f4cc11a39bf848 ]

epoll_loop_check_proc() can run into a file already committed to destruction;
we can't grab a reference on those and don't need to add them to the set for
reverse path check anyway.

Tested-by: Marc Zyngier <maz@kernel.org>
Fixes: a9ed4a6560b8 ("epoll: Keep a reference on files added to the check list")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-09-09 19:03:09 +02:00
Greg Kroah-Hartman
a929037d0e This is the 4.14.195 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAl9GHZYACgkQONu9yGCS
 aT4yJQ//SYMo7u1etFoWpZw7QuWb7G4gMcSm4LEu5eBpJDg1B2QzTlN7zG1UGu4G
 d5hFHAyUg8xDenIUo7oqqR4SjfIHyQhukiTabM6DZOfXSkW9cbHJYVXe77SR2+87
 KT79vND4vvt1Cqpmw27ZAevBUk7Myq8bw5sB5Y+1F1Dg+Z+Ya19B6ds+w9EkWqh/
 hEJxk2Xdn2JaNED7go32mspfQAfG1FlZ+LzXn1SdZ1vMd3NJWucRDGl1QopQ1NBT
 W/9fol2r5CKvpmGGXRf3+6qymcaa1mxwMqs3fR5UrVkBcl4gnJzexyD08osBL/lJ
 54TuJ2i4QbNwHP00YlVxjpH7sHOUiKXUA6ZXz3Br+eSACueWMMGQOyy3YT2hPM3y
 FkQrrljsoj8iRP3IT+kB8OdfaVMeYxn6XjlEahXOr5vgVzRxmqHSEk9aWLP7sLbX
 jxha0xh/sfW+PtLRa9OKlGU+AgutQSFC5YxCRglo12mMiZU6b+qddR9NYCQRYabz
 idHVaZl8ByfaBTODA4HZvDnikDDOp/xqtxlDBbW5rGVMu2a6iksxda3H7Wpyvb1x
 5y+UCNDlzNiehcFqL4lctx+symuVRZB9dfh/vq6D/Dv5aJ6jUPdjPZ9xKtJJF2ww
 sgaWXhQFUA6TCIFFbfpN0gbr8CYxBUaOXNIfyj+qKAm1l9k4rZI=
 =cZzK
 -----END PGP SIGNATURE-----

Merge 4.14.195 into android-4.14-stable

Changes in 4.14.195
	drm/vgem: Replace opencoded version of drm_gem_dumb_map_offset()
	perf probe: Fix memory leakage when the probe point is not found
	khugepaged: khugepaged_test_exit() check mmget_still_valid()
	khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()
	powerpc/mm: Only read faulting instruction when necessary in do_page_fault()
	powerpc: Allow 4224 bytes of stack expansion for the signal frame
	btrfs: export helpers for subvolume name/id resolution
	btrfs: don't show full path of bind mounts in subvol=
	btrfs: Move free_pages_out label in inline extent handling branch in compress_file_range
	btrfs: inode: fix NULL pointer dereference if inode doesn't need compression
	btrfs: sysfs: use NOFS for device creation
	romfs: fix uninitialized memory leak in romfs_dev_read()
	kernel/relay.c: fix memleak on destroy relay channel
	mm: include CMA pages in lowmem_reserve at boot
	mm, page_alloc: fix core hung in free_pcppages_bulk()
	ext4: fix checking of directory entry validity for inline directories
	jbd2: add the missing unlock_buffer() in the error path of jbd2_write_superblock()
	spi: Prevent adding devices below an unregistering controller
	scsi: ufs: Add DELAY_BEFORE_LPM quirk for Micron devices
	media: budget-core: Improve exception handling in budget_register()
	rtc: goldfish: Enable interrupt in set_alarm() when necessary
	media: vpss: clean up resources in init
	Input: psmouse - add a newline when printing 'proto' by sysfs
	m68knommu: fix overwriting of bits in ColdFire V3 cache control
	xfs: fix inode quota reservation checks
	jffs2: fix UAF problem
	cpufreq: intel_pstate: Fix cpuinfo_max_freq when MSR_TURBO_RATIO_LIMIT is 0
	scsi: libfc: Free skb in fc_disc_gpn_id_resp() for valid cases
	virtio_ring: Avoid loop when vq is broken in virtqueue_poll
	xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init
	alpha: fix annotation of io{read,write}{16,32}be()
	ext4: fix potential negative array index in do_split()
	i40e: Set RX_ONLY mode for unicast promiscuous on VLAN
	i40e: Fix crash during removing i40e driver
	net: fec: correct the error path for regulator disable in probe
	bonding: show saner speed for broadcast mode
	bonding: fix a potential double-unregister
	ASoC: msm8916-wcd-analog: fix register Interrupt offset
	ASoC: intel: Fix memleak in sst_media_open
	vfio/type1: Add proper error unwind for vfio_iommu_replay()
	bonding: fix active-backup failover for current ARP slave
	hv_netvsc: Fix the queue_mapping in netvsc_vf_xmit()
	net: dsa: b53: check for timeout
	powerpc/pseries: Do not initiate shutdown when system is running on UPS
	epoll: Keep a reference on files added to the check list
	do_epoll_ctl(): clean the failure exits up a bit
	mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
	xen: don't reschedule in preemption off sections
	clk: Evict unregistered clks from parent caches
	KVM: arm/arm64: Don't reschedule in unmap_stage2_range()
	Linux 4.14.195

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I6c25044ba9166ec01723671d9cfa3fdf08ccc43f
2020-08-26 11:01:28 +02:00
Al Viro
6875d79ba7 do_epoll_ctl(): clean the failure exits up a bit
commit 52c479697c9b73f628140dcdfcd39ea302d05482 upstream.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26 10:29:59 +02:00
Marc Zyngier
7867516bcb epoll: Keep a reference on files added to the check list
commit a9ed4a6560b8562b7e2e2bed9527e88001f7b682 upstream.

When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.

However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.

Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-26 10:29:59 +02:00
Tri Vo
90001697d1 BACKPORT: PM / wakeup: Show wakeup sources stats in sysfs
Add an ID and a device pointer to 'struct wakeup_source'. Use them to to
expose wakeup sources statistics in sysfs under
/sys/class/wakeup/wakeup<ID>/*.

Co-developed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Co-developed-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Tri Vo <trong@android.com>
Tested-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit c8377adfa78103be5380200eb9dab764d7ca890e)
[ Replaced ida_alloc()/ida_free( with) ida_simple_get()/ida_simple_remove() as
  the former is not present in 4.14. ]
Bug: 129087298
Signed-off-by: Tri Vo <trong@google.com>
Change-Id: Iecd3412423f9d499981f44d3b69507eaa62a2cd9
2019-10-11 16:48:59 -07:00
Greg Kroah-Hartman
0951849351 This is the 4.14.99 stable release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAlxjFHYACgkQONu9yGCS
 aT5H0w/8D6sNHqN7RXJogmqrdCg6xzfwxdQs+56WWjOUrEjfzpT/x6c8+f/s8nuA
 lvTO12FKspWxLnwYNx+hcLUnIzs1LX6XxH/ls1mHWVqQ11We/yrUhnpEFhcCLITn
 n8fSx6OOI/MeblsvW6qkz+uIf7EwArvCdv956nSeF6UyJVXxe67kyJKZZYCsuvpL
 GvBmMpr+REhKu170bHP7cS783iLN6HavEcuzxnjxh7T6+InTa1zJQQATL3NR/PWZ
 Cy1RAwS3pSeGA6UqeXbslTPwuQ5MCCXucP3cGvAMRURxNEbUL44dBS1G3+HDurn6
 OqXxO4u0mBhRabm4cFGXrFXqVJOCn0O2E1Otjat7S9To/bRINZsVkj+XjqLHyHuN
 oPmsQUiRZ3iBOphr69WuxzjQRtHzkalJmo/poAdPhvA4OFqKB8n3Q4GLU1nqp4qx
 EXJ7pQo2AanClkP2j7lIQtdM7n8lPyy80hnR484ysP0pZaH2ErGMrK60rzZLDDlY
 Zk7i8uZ/jTrHsIXgWYL2F5Lz4St6MY49ePQaFR4Wjy7p76ZoPZMkYmxvA5ZrmOpT
 UyggwNwT6kh0/yzXTZZ5O1N+7IUeam9Br+2UHxmpYXMs/P6xjW0YAczDZ9crqiV8
 zys1u1gR+DKL/bw7JMEIMDsUZlhhzxmW9Eidfl7QlUpiYJNpM00=
 =ShQi
 -----END PGP SIGNATURE-----

Merge 4.14.99 into android-4.14

Changes in 4.14.99
	drm/bufs: Fix Spectre v1 vulnerability
	staging: iio: adc: ad7280a: handle error from __ad7280_read32()
	drm/vgem: Fix vgem_init to get drm device available.
	pinctrl: bcm2835: Use raw spinlock for RT compatibility
	ASoC: Intel: mrfld: fix uninitialized variable access
	gpu: ipu-v3: image-convert: Prevent race between run and unprepare
	ath9k: dynack: use authentication messages for 'late' ack
	scsi: lpfc: Correct LCB RJT handling
	scsi: mpt3sas: Call sas_remove_host before removing the target devices
	scsi: lpfc: Fix LOGO/PLOGI handling when triggerd by ABTS Timeout event
	ARM: 8808/1: kexec:offline panic_smp_self_stop CPU
	clk: boston: fix possible memory leak in clk_boston_setup()
	dlm: Don't swamp the CPU with callbacks queued during recovery
	x86/PCI: Fix Broadcom CNB20LE unintended sign extension (redux)
	powerpc/pseries: add of_node_put() in dlpar_detach_node()
	crypto: aes_ti - disable interrupts while accessing S-box
	drm/vc4: ->x_scaling[1] should never be set to VC4_SCALING_NONE
	serial: fsl_lpuart: clear parity enable bit when disable parity
	ptp: check gettime64 return code in PTP_SYS_OFFSET ioctl
	MIPS: Boston: Disable EG20T prefetch
	staging:iio:ad2s90: Make probe handle spi_setup failure
	fpga: altera-cvp: Fix registration for CvP incapable devices
	Tools: hv: kvp: Fix a warning of buffer overflow with gcc 8.0.1
	platform/chrome: don't report EC_MKBP_EVENT_SENSOR_FIFO as wakeup
	staging: iio: ad7780: update voltage on read
	usbnet: smsc95xx: fix rx packet alignment
	drm/rockchip: fix for mailbox read size
	ARM: OMAP2+: hwmod: Fix some section annotations
	net/mlx5: EQ, Use the right place to store/read IRQ affinity hint
	modpost: validate symbol names also in find_elf_symbol
	perf tools: Add Hygon Dhyana support
	soc/tegra: Don't leak device tree node reference
	media: mtk-vcodec: Release device nodes in mtk_vcodec_init_enc_pm()
	ptp: Fix pass zero to ERR_PTR() in ptp_clock_register
	dmaengine: xilinx_dma: Remove __aligned attribute on zynqmp_dma_desc_ll
	iio: adc: meson-saradc: check for devm_kasprintf failure
	iio: adc: meson-saradc: fix internal clock names
	iio: accel: kxcjk1013: Add KIOX010A ACPI Hardware-ID
	media: adv*/tc358743/ths8200: fill in min width/height/pixelclock
	ACPI: SPCR: Consider baud rate 0 as preconfigured state
	staging: pi433: fix potential null dereference
	f2fs: move dir data flush to write checkpoint process
	f2fs: fix race between write_checkpoint and write_begin
	f2fs: fix wrong return value of f2fs_acl_create
	i2c: sh_mobile: add support for r8a77990 (R-Car E3)
	arm64: io: Ensure calls to delay routines are ordered against prior readX()
	sunvdc: Do not spin in an infinite loop when vio_ldc_send() returns EAGAIN
	soc: bcm: brcmstb: Don't leak device tree node reference
	nfsd4: fix crash on writing v4_end_grace before nfsd startup
	drm: Clear state->acquire_ctx before leaving drm_atomic_helper_commit_duplicated_state()
	arm64: io: Ensure value passed to __iormb() is held in a 64-bit register
	Thermal: do not clear passive state during system sleep
	firmware/efi: Add NULL pointer checks in efivars API functions
	s390/zcrypt: improve special ap message cmd handling
	arm64: ftrace: don't adjust the LR value
	ARM: dts: mmp2: fix TWSI2
	x86/fpu: Add might_fault() to user_insn()
	media: DaVinci-VPBE: fix error handling in vpbe_initialize()
	smack: fix access permissions for keyring
	usb: dwc3: Correct the logic for checking TRB full in __dwc3_prepare_one_trb()
	usb: hub: delay hub autosuspend if USB3 port is still link training
	timekeeping: Use proper seqcount initializer
	usb: mtu3: fix the issue about SetFeature(U1/U2_Enable)
	clk: sunxi-ng: a33: Set CLK_SET_RATE_PARENT for all audio module clocks
	driver core: Move async_synchronize_full call
	kobject: return error code if writing /sys/.../uevent fails
	IB/hfi1: Unreserve a reserved request when it is completed
	usb: dwc3: trace: add missing break statement to make compiler happy
	pinctrl: sx150x: handle failure case of devm_kstrdup
	iommu/amd: Fix amd_iommu=force_isolation
	ARM: dts: Fix OMAP4430 SDP Ethernet startup
	mips: bpf: fix encoding bug for mm_srlv32_op
	media: coda: fix H.264 deblocking filter controls
	ARM: dts: Fix up the D-Link DIR-685 MTD partition info
	watchdog: renesas_wdt: don't set divider while watchdog is running
	usb: dwc3: gadget: Disable CSP for stream OUT ep
	iommu/arm-smmu: Add support for qcom,smmu-v2 variant
	iommu/arm-smmu-v3: Use explicit mb() when moving cons pointer
	sata_rcar: fix deferred probing
	clk: imx6sl: ensure MMDC CH0 handshake is bypassed
	cpuidle: big.LITTLE: fix refcount leak
	OPP: Use opp_table->regulators to verify no regulator case
	i2c-axxia: check for error conditions first
	phy: sun4i-usb: add support for missing USB PHY index
	udf: Fix BUG on corrupted inode
	switchtec: Fix SWITCHTEC_IOCTL_EVENT_IDX_ALL flags overwrite
	selftests/bpf: use __bpf_constant_htons in test_prog.c
	ARM: pxa: avoid section mismatch warning
	ASoC: fsl: Fix SND_SOC_EUKREA_TLV320 build error on i.MX8M
	KVM: PPC: Book3S: Only report KVM_CAP_SPAPR_TCE_VFIO on powernv machines
	mmc: bcm2835: Recover from MMC_SEND_EXT_CSD
	mmc: bcm2835: reset host on timeout
	memstick: Prevent memstick host from getting runtime suspended during card detection
	mmc: sdhci-of-esdhc: Fix timeout checks
	mmc: sdhci-xenon: Fix timeout checks
	tty: serial: samsung: Properly set flags in autoCTS mode
	perf test: Fix perf_event_attr test failure
	perf header: Fix unchecked usage of strncpy()
	perf probe: Fix unchecked usage of strncpy()
	arm64: KVM: Skip MMIO insn after emulation
	usb: musb: dsps: fix otg state machine
	percpu: convert spin_lock_irq to spin_lock_irqsave.
	powerpc/uaccess: fix warning/error with access_ok()
	mac80211: fix radiotap vendor presence bitmap handling
	xfrm6_tunnel: Fix spi check in __xfrm6_tunnel_alloc_spi
	Bluetooth: Fix unnecessary error message for HCI request completion
	mlxsw: spectrum: Properly cleanup LAG uppers when removing port from LAG
	scsi: smartpqi: correct host serial num for ssa
	scsi: smartpqi: correct volume status
	scsi: smartpqi: increase fw status register read timeout
	cw1200: Fix concurrency use-after-free bugs in cw1200_hw_scan()
	powerpc/perf: Fix thresholding counter data for unknown type
	drbd: narrow rcu_read_lock in drbd_sync_handshake
	drbd: disconnect, if the wrong UUIDs are attached on a connected peer
	drbd: skip spurious timeout (ping-timeo) when failing promote
	drbd: Avoid Clang warning about pointless switch statment
	video: clps711x-fb: release disp device node in probe()
	md: fix raid10 hang issue caused by barrier
	fbdev: fbmem: behave better with small rotated displays and many CPUs
	i40e: define proper net_device::neigh_priv_len
	igb: Fix an issue that PME is not enabled during runtime suspend
	ACPI/APEI: Clear GHES block_status before panic()
	fbdev: fbcon: Fix unregister crash when more than one framebuffer
	powerpc/mm: Fix reporting of kernel execute faults on the 8xx
	pinctrl: meson: meson8: fix the GPIO function for the GPIOAO pins
	pinctrl: meson: meson8b: fix the GPIO function for the GPIOAO pins
	KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported
	powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
	kvm: Change offset in kvm_write_guest_offset_cached to unsigned
	NFS: nfs_compare_mount_options always compare auth flavors.
	hwmon: (lm80) fix a missing check of the status of SMBus read
	hwmon: (lm80) fix a missing check of bus read in lm80 probe
	seq_buf: Make seq_buf_puts() null-terminate the buffer
	crypto: ux500 - Use proper enum in cryp_set_dma_transfer
	crypto: ux500 - Use proper enum in hash_set_dma_transfer
	MIPS: ralink: Select CONFIG_CPU_MIPSR2_IRQ_VI on MT7620/8
	cifs: check ntwrk_buf_start for NULL before dereferencing it
	um: Avoid marking pages with "changed protection"
	niu: fix missing checks of niu_pci_eeprom_read
	f2fs: fix sbi->extent_list corruption issue
	cgroup: fix parsing empty mount option string
	scripts/decode_stacktrace: only strip base path when a prefix of the path
	ocfs2: don't clear bh uptodate for block read
	ocfs2: improve ocfs2 Makefile
	isdn: hisax: hfc_pci: Fix a possible concurrency use-after-free bug in HFCPCI_l1hw()
	gdrom: fix a memory leak bug
	fsl/fman: Use GFP_ATOMIC in {memac,tgec}_add_hash_mac_address()
	block/swim3: Fix -EBUSY error when re-opening device after unmount
	thermal: bcm2835: enable hwmon explicitly
	kdb: Don't back trace on a cpu that didn't round up
	thermal: generic-adc: Fix adc to temp interpolation
	HID: lenovo: Add checks to fix of_led_classdev_register
	kernel/hung_task.c: break RCU locks based on jiffies
	proc/sysctl: fix return error for proc_doulongvec_minmax()
	kernel/hung_task.c: force console verbose before panic
	fs/epoll: drop ovflist branch prediction
	exec: load_script: don't blindly truncate shebang string
	scripts/gdb: fix lx-version string output
	thermal: hwmon: inline helpers when CONFIG_THERMAL_HWMON is not set
	dccp: fool proof ccid_hc_[rt]x_parse_options()
	enic: fix checksum validation for IPv6
	net: dp83640: expire old TX-skb
	rxrpc: bad unlock balance in rxrpc_recvmsg
	skge: potential memory corruption in skge_get_regs()
	rds: fix refcount bug in rds_sock_addref
	net: systemport: Fix WoL with password after deep sleep
	net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames
	net: dsa: slave: Don't propagate flag changes on down slave interfaces
	ALSA: compress: Fix stop handling on compressed capture streams
	ALSA: hda - Serialize codec registrations
	fuse: call pipe_buf_release() under pipe lock
	fuse: decrement NR_WRITEBACK_TEMP on the right page
	fuse: handle zero sized retrieve correctly
	dmaengine: bcm2835: Fix interrupt race on RT
	dmaengine: bcm2835: Fix abort of transactions
	dmaengine: imx-dma: fix wrong callback invoke
	futex: Handle early deadlock return correctly
	irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID
	usb: phy: am335x: fix race condition in _probe
	usb: dwc3: gadget: Handle 0 xfer length for OUT EP
	usb: gadget: udc: net2272: Fix bitwise and boolean operations
	usb: gadget: musb: fix short isoc packets with inventra dma
	staging: speakup: fix tty-operation NULL derefs
	scsi: cxlflash: Prevent deadlock when adapter probe fails
	scsi: aic94xx: fix module loading
	KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
	kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
	KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
	cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM
	perf/x86/intel/uncore: Add Node ID mask
	x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()
	perf/core: Don't WARN() for impossible ring-buffer sizes
	perf tests evsel-tp-sched: Fix bitwise operator
	serial: fix race between flush_to_ldisc and tty_open
	serial: 8250_pci: Make PCI class test non fatal
	nfsd4: fix cached replies to solo SEQUENCE compounds
	nfsd4: catch some false session retries
	IB/hfi1: Add limit test for RC/UC send via loopback
	perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu()
	ath9k: dynack: make ewma estimation faster
	ath9k: dynack: check da->enabled first in sampling routines
	Linux 4.14.99

Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-02-12 20:21:21 +01:00
Davidlohr Bueso
fad3ec7ce4 fs/epoll: drop ovflist branch prediction
[ Upstream commit 76699a67f3041ff4c7af6d6ee9be2bfbf1ffb671 ]

The ep->ovflist is a secondary ready-list to temporarily store events
that might occur when doing sproc without holding the ep->wq.lock.  This
accounts for every time we check for ready events and also send events
back to userspace; both callbacks, particularly the latter because of
copy_to_user, can account for a non-trivial time.

As such, the unlikely() check to see if the pointer is being used, seems
both misleading and sub-optimal.  In fact, we go to an awful lot of
trouble to sync both lists, and populating the ovflist is far from an
uncommon scenario.

For example, profiling a concurrent epoll_wait(2) benchmark, with
CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33%
incorrect rate was seen; and when incrementally increasing the number of
epoll instances (which is used, for example for multiple queuing load
balancing models), up to a 90% incorrect rate was seen.

Similarly, by deleting the prediction, 3% throughput boost was seen
across incremental threads.

Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-12 19:46:10 +01:00
Colin Cross
eada3e0077 ANDROID: fs: epoll: use freezable blocking call
Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Change-Id: I848d08d28c89302fd42bbbdfa76489a474ab27bf
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2017-12-18 21:11:22 +05:30
Davidlohr Bueso
b2ac2ea629 fs/epoll: use faster rb_first_cached()
...  such that we can avoid the tree walks to get the node with the
smallest key.  Semantically the same, as the previously used rb_first(),
but O(1).  The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.

Link: http://lkml.kernel.org/r/20170719014603.19029-15-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:49 -07:00
Oleg Nesterov
138e4ad67a epoll: fix race between ep_poll_callback(POLLFREE) and ep_free()/ep_remove()
The race was introduced by me in commit 971316f0503a ("epoll:
ep_unregister_pollwait() can use the freed pwq->whead").  I did not
realize that nothing can protect eventpoll after ep_poll_callback() sets
->whead = NULL, only whead->lock can save us from the race with
ep_free() or ep_remove().

Move ->whead = NULL to the end of ep_poll_callback() and add the
necessary barriers.

TODO: cleanup the ewake/EPOLLEXCLUSIVE logic, it was confusing even
before this patch.

Hopefully this explains use-after-free reported by syzcaller:

	BUG: KASAN: use-after-free in debug_spin_lock_before
	...
	 _raw_spin_lock_irqsave+0x4a/0x60 kernel/locking/spinlock.c:159
	 ep_poll_callback+0x29f/0xff0 fs/eventpoll.c:1148

this is spin_lock(eventpoll->lock),

	...
	Freed by task 17774:
	...
	 kfree+0xe8/0x2c0 mm/slub.c:3883
	 ep_free+0x22c/0x2a0 fs/eventpoll.c:865

Fixes: 971316f0503a ("epoll: ep_unregister_pollwait() can use the freed pwq->whead")
Reported-by: 范龙飞 <long7573@126.com>
Cc: stable@vger.kernel.org
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-01 13:07:35 -07:00
Cyrill Gorcunov
92ef6da3d0 kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE
kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
appropriate helpers in epoll code with the config to build it
conditionally.

Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lan
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Andrew Morton <akpm@linuxfoundation.org>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov
0791e3644e kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files
With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.

Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.

Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.

To test if some particular file is matching entry inside epoll one have
to

 - fill kcmp_epoll_slot structure with epoll file descriptor,
   target file number and target file offset (in case if only
   one target is present then it should be 0)

 - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
    - the kernel fetch file pointer matching file descriptor @fd of pid1
    - lookups for file struct in epoll queue of pid2 and returns traditional
      0,1,2 result for sorting purpose

Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov
77493f04b7 procfs: fdinfo: extend information about epoll target files
Since it is possbile to have same number in tfd field (say file added,
closed, then nother file dup'ed to same number and added back) it is
imposible to distinguish such target files solely by their numbers.

Strictly speaking regular applications don't need to recognize these
targets at all but for checkpoint/restore sake we need to collect
targets to be able to push them back on restore stage in a proper order.

Thus lets add file position, inode and device number where this target
lays.  This three fields can be used as a primary key for sorting, and
together with kcmp help CRIU can find out an exact file target (from the
whole set of processes being checkpointed).

Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
David Rientjes
c257a340ed fs, epoll: short circuit fetching events if thread has been killed
We've encountered zombies that are waiting for a thread to exit that are
looping in ep_poll() almost endlessly although there is a pending
SIGKILL as a result of a group exit.

This happens because we always find ep_events_available() and fetch more
events and never are able to check for signal_pending() that would break
from the loop and return -EINTR.

Special case fatal signals and break immediately to guarantee that we
loop to fetch more events and delay making a timely exit.

It would also be possible to simply move the check for signal_pending()
higher than checking for ep_events_available(), but there have been no
reports of delayed signal handling other than SIGKILL preventing zombies
from exiting that would be fixed by this.

It fixes an issue for us where we have witnessed zombies sticking around
for at least O(minutes), but considering the code has been like this
forever and nobody else has complained that I have found, I would simply
queue it up for 4.12.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
Ingo Molnar
2055da9738 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
So I've noticed a number of instances where it was not obvious from the
code whether ->task_list was for a wait-queue head or a wait-queue entry.

Furthermore, there's a number of wait-queue users where the lists are
not for 'tasks' but other entities (poll tables, etc.), in which case
the 'task_list' name is actively confusing.

To clear this all up, name the wait-queue head and entry list structure
fields unambiguously:

	struct wait_queue_head::task_list	=> ::head
	struct wait_queue_entry::task_list	=> ::entry

For example, this code:

	rqw->wait.task_list.next != &wait->task_list

... is was pretty unclear (to me) what it's doing, while now it's written this way:

	rqw->wait.head.next != &wait->entry

... which makes it pretty clear that we are iterating a list until we see the head.

Other examples are:

	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
	list_for_each_entry(wq, &fence->wait.task_list, task_list) {

... where it's unclear (to me) what we are iterating, and during review it's
hard to tell whether it's trying to walk a wait-queue entry (which would be
a bug), while now it's written as:

	list_for_each_entry_safe(pos, next, &x->head, entry) {
	list_for_each_entry(wq, &fence->wait.head, entry) {

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:19:14 +02:00
Ingo Molnar
ac6424b981 sched/wait: Rename wait_queue_t => wait_queue_entry_t
Rename:

	wait_queue_t		=>	wait_queue_entry_t

'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
which had to carry the name.

Start sorting this out by renaming it to 'wait_queue_entry_t'.

This also allows the real structure name 'struct __wait_queue' to
lose its double underscore and become 'struct wait_queue_entry',
which is the more canonical nomenclature for such data types.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-20 12:18:27 +02:00
Sridhar Samudrala
bf3b9f6372 epoll: Add busy poll support to epoll with socket fds.
This patch adds busy poll support to epoll. The implementation is meant to
be opportunistic in that it will take the NAPI ID from the last socket
that is added to the ready list that contains a valid NAPI ID and it will
use that for busy polling until the ready list goes empty.  Once the ready
list goes empty the NAPI ID is reset and busy polling is disabled until a
new socket is added to the ready list.

In addition when we insert a new socket into the epoll we record the NAPI
ID and assume we are going to receive events on it.  If that doesn't occur
it will be evicted as the active NAPI ID and we will resume normal
behavior.

An application can use SO_INCOMING_CPU or SO_REUSEPORT_ATTACH_C/EBPF socket
options to spread the incoming connections to specific worker threads
based on the incoming queue. This enables epoll for each worker thread
to have only sockets that receive packets from a single queue. So when an
application calls epoll_wait() and there are no events available to report,
busy polling is done on the associated queue to pull the packets.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-24 20:49:31 -07:00
Ingo Molnar
174cd4b1e5 sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:32 +01:00
Cyrill Gorcunov
c857ab640c fs,eventpoll: don't test for bitfield with stack value
In case if epoll_ctl is called with operation EPOLL_CTL_DEL then
@epds.events variable allocated on stack may contain random bits which
we test then for EPOLLEXCLUSIVE.  Since currently the test look like

	if (epds.events & EPOLLEXCLUSIVE) {
		if (op == EPOLL_CTL_MOD)
			goto error_tgt_fput;
		if (op == EPOLL_CTL_ADD && (is_file_epoll(tf.file) ||
				(epds.events & ~EPOLLEXCLUSIVE_OK_BITS)))
			goto error_tgt_fput;
	}

Nothing serious will happen even if epds.events has this bit set, still
better to be on safe side and make sure that we're to test this bit at
all.

Link: http://lkml.kernel.org/r/20170214154935.GG1850@uranus.lan
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:45 -08:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Deepa Dinamani
766b9f928b fs: poll/select/recvmmsg: use timespec64 for timeout events
struct timespec is not y2038 safe.  Even though timespec might be
sufficient to represent timeouts, use struct timespec64 here as the plan
is to get rid of all timespec reference in the kernel.

The patch transitions the common functions: poll_select_set_timeout()
and select_estimate_accuracy() to use timespec64.  And, all the syscalls
that use these functions are transitioned in the same patch.

The restart block parameters for poll uses monotonic time.  Use
timespec64 here as well to assign timeout value.  This parameter in the
restart block need not change because this only holds the monotonic
timestamp at which timeout should occur.  And, unsigned long data type
should be big enough for this timestamp.

The system call interfaces will be handled in a separate series.

Compat interfaces need not change as timespec64 is an alias to struct
timespec on a 64 bit system.

Link: http://lkml.kernel.org/r/1461947989-21926-3-git-send-email-deepa.kernel@gmail.com
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: John Stultz <john.stultz@linaro.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
John Stultz
da8b44d5a9 timer: convert timer_slack_ns from unsigned long to u64
This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).

The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems.  It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.

The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.

With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:

$ time sleep 1

real    0m10.747s
user    0m0.001s
sys     0m0.005s

The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s.  Let me
know if it makes sense to break that up more or not.

Other than that things are fairly straightforward.

This patch (of 2):

The timer_slack_ns value in the task struct is currently a unsigned
long.  This means that on 32bit applications, the maximum slack is just
over 4 seconds.  However, on 64bit machines, its much much larger (~500
years).

This disparity could make application development a little (as well as
the default_slack) to a u64.  This means both 32bit and 64bit systems
have the same effective internal slack range.

Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.

This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Jason Baron
b6a515c8a0 epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
In the current implementation of the EPOLLEXCLUSIVE flag (added for
4.5-rc1), if epoll waiters create different POLL* sets and register them
as exclusive against the same target fd, the current implementation will
stop waking any further waiters once it finds the first idle waiter.
This means that waiters could miss wakeups in certain cases.

For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
wakeup since the current implementation will stop after it finds any
intersection of events with a waiter that is blocked in epoll_wait().

We could potentially address this by requiring all epoll waiters that
are added to p be required to pass the same set of POLL* events.  IE the
first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
However, I think it might be somewhat confusing interface as we would
have to reference count the number of users for that set, and so
userspace would have to keep track of that count, or we would need a
more involved interface.  It also adds some shared state that we'd have
store somewhere.  I don't think anybody will want to bloat
__wait_queue_head for this.

I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
such that it can only be specified with EPOLLIN and/or EPOLLOUT.  So
that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
once we hit the first idle waiter that specifies the EPOLLIN bit, since
any remaining waiters that only have 'POLLOUT' set wouldn't need to be
woken.  Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
bit set and not 'POLLIN'.  If both 'POLLOUT' and 'POLLIN' are set in the
wake bit set (there is at least one example of this I saw in fs/pipe.c),
then we just wake the entire exclusive list.  Having both 'POLLOUT' and
'POLLIN' both set should not be on any performance critical path, so I
think that's ok (in fs/pipe.c its in pipe_release()).  We also continue
to include EPOLLERR and EPOLLHUP by default in any exclusive set.  Thus,
the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
so.

Since epoll waiters may be interested in other events as well besides
EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
doing a 'dup' call on the target fd and adding that as one normally
would with EPOLL_CTL_ADD.  Since I think that the POLLIN and POLLOUT
events are what we are interest in balancing, I think that the 'dup'
thing could perhaps be added to only one of the waiter threads.
However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
sufficient for the majority of use-cases.

Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
among multiple epfds, where between 1 and n of the epfds may receive an
event, it does not satisfy the semantics of EPOLLONESHOT where only 1
epfd would get an event.  Thus, it is not allowed to be specified in
conjunction with EPOLLEXCLUSIVE.

EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE.  It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Madars Vitolins <m@silodev.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05 18:10:40 -08:00
Jason Baron
df0108c5da epoll: add EPOLLEXCLUSIVE flag
Currently, epoll file descriptors or epfds (the fd returned from
epoll_create[1]()) that are added to a shared wakeup source are always
added in a non-exclusive manner.  This means that when we have multiple
epfds attached to a shared fd source they are all woken up.  This creates
thundering herd type behavior.

Introduce a new 'EPOLLEXCLUSIVE' flag that can be passed as part of the
'event' argument during an epoll_ctl() EPOLL_CTL_ADD operation.  This new
flag allows for exclusive wakeups when there are multiple epfds attached
to a shared fd event source.

The implementation walks the list of exclusive waiters, and queues an
event to each epfd, until it finds the first waiter that has threads
blocked on it via epoll_wait().  The idea is to search for threads which
are idle and ready to process the wakeup events.  Thus, we queue an event
to at least 1 epfd, but may still potentially queue an event to all epfds
that are attached to the shared fd source.

Performance testing was done by Madars Vitolins using a modified version
of Enduro/X.  The use of the 'EPOLLEXCLUSIVE' flag reduce the length of
this particular workload from 860s down to 24s.

Sample epoll_clt text:

EPOLLEXCLUSIVE

  Sets an exclusive wakeup mode for the epfd file descriptor that is
  being attached to the target file descriptor, fd.  Thus, when an event
  occurs and multiple epfd file descriptors are attached to the same
  target file using EPOLLEXCLUSIVE, one or more epfds will receive an
  event with epoll_wait(2).  The default in this scenario (when
  EPOLLEXCLUSIVE is not set) is for all epfds to receive an event.
  EPOLLEXCLUSIVE may only be specified with the op EPOLL_CTL_ADD.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Madars Vitolins <m@silodev.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@ftp.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Davidlohr Bueso
4d5755b147 epoll: optimize setting task running after blocking
After waking up a task waiting for an event, we explicitly mark it as
TASK_RUNNING (which is necessary as we do the checks for wakeups as
TASK_INTERRUPTIBLE).  Once running and dealing with actually delivering
the events, we're obviously not planning on calling schedule, thus we can
relax the implied barrier and simply update the state with
__set_current_state().

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:40 -08:00
Joe Perches
a3816ab0e8 fs: Convert show_fdinfo functions to void
seq_printf functions shouldn't really check the return value.
Checking seq_has_overflowed() occasionally is used instead.

Update vfs documentation.

Link: http://lkml.kernel.org/p/e37e6e7b76acbdcc3bb4ab2a57c8f8ca1ae11b9a.1412031505.git.joe@perches.com

Cc: David S. Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Joe Perches <joe@perches.com>
[ did a few clean ups ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-11-05 14:13:23 -05:00
Nicolas Iooss
c680e41b3a eventpoll: fix uninitialized variable in epoll_ctl
When calling epoll_ctl with operation EPOLL_CTL_DEL, structure epds is
not initialized but ep_take_care_of_epollwakeup reads its event field.
When this unintialized field has EPOLLWAKEUP bit set, a capability check
is done for CAP_BLOCK_SUSPEND in ep_take_care_of_epollwakeup.  This
produces unexpected messages in the audit log, such as (on a system
running SELinux):

    type=AVC msg=audit(1408212798.866:410): avc:  denied
    { block_suspend } for  pid=7754 comm="dbus-daemon" capability=36
    scontext=unconfined_u:unconfined_r:unconfined_t
    tcontext=unconfined_u:unconfined_r:unconfined_t
    tclass=capability2 permissive=1

    type=SYSCALL msg=audit(1408212798.866:410): arch=c000003e syscall=233
    success=yes exit=0 a0=3 a1=2 a2=9 a3=7fffd4d66ec0 items=0 ppid=1
    pid=7754 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0
    fsgid=0 tty=(none) ses=3 comm="dbus-daemon"
    exe="/usr/bin/dbus-daemon"
    subj=unconfined_u:unconfined_r:unconfined_t key=(null)

("arch=c000003e syscall=233 a1=2" means "epoll_ctl(op=EPOLL_CTL_DEL)")

Remove use of epds in epoll_ctl when op == EPOLL_CTL_DEL.

Fixes: 4d7e30d98939 ("epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10 15:42:12 -07:00
Konstantin Khlebnikov
ebe06187bf epoll: fix use-after-free in eventpoll_release_file
This fixes use-after-free of epi->fllink.next inside list loop macro.
This loop actually releases elements in the body.  The list is
rcu-protected but here we cannot hold rcu_read_lock because we need to
lock mutex inside.

The obvious solution is to use list_for_each_entry_safe().  RCU-ness
isn't essential because nobody can change this list under us, it's final
fput for this file.

The bug was introduced by ae10b2b4eb01 ("epoll: optimize EPOLL_CTL_DEL
using rcu")

Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reported-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Stable <stable@vger.kernel.org> # 3.13+
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-16 17:21:59 -10:00
Joe Perches
1f7e0616cd fs: convert use of typedef ctl_table to struct ctl_table
This typedef is unnecessary and should just be removed.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06 16:08:16 -07:00
Jason Baron
4ff36ee94d epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL
The EPOLL_CTL_DEL path of epoll contains a classic, ab-ba deadlock.
That is, epoll_ctl(a, EPOLL_CTL_DEL, b, x), will deadlock with
epoll_ctl(b, EPOLL_CTL_DEL, a, x).  The deadlock was introduced with
commmit 67347fe4e632 ("epoll: do not take global 'epmutex' for simple
topologies").

The acquistion of the ep->mtx for the destination 'ep' was added such
that a concurrent EPOLL_CTL_ADD operation would see the correct state of
the ep (Specifically, the check for '!list_empty(&f.file->f_ep_links')

However, by simply not acquiring the lock, we do not serialize behind
the ep->mtx from the add path, and thus may perform a full path check
when if we had waited a little longer it may not have been necessary.
However, this is a transient state, and performing the full loop
checking in this case is not harmful.

The important point is that we wouldn't miss doing the full loop
checking when required, since EPOLL_CTL_ADD always locks any 'ep's that
its operating upon.  The reason we don't need to do lock ordering in the
add path, is that we are already are holding the global 'epmutex'
whenever we do the double lock.  Further, the original posting of this
patch, which was tested for the intended performance gains, did not
perform this additional locking.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02 14:40:30 -08:00
Amit Pundir
95f19f658c epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
Drop EPOLLWAKEUP from epoll events mask if CONFIG_PM_SLEEP is disabled.

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-12-03 15:35:52 +01:00
Linus Torvalds
5cbb3d216e Merge branch 'akpm' (patches from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 "Quite a lot of other stuff is banked up awaiting further
  next->mainline merging, but this batch contains:

   - Lots of random misc patches
   - OCFS2
   - Most of MM
   - backlight updates
   - lib/ updates
   - printk updates
   - checkpatch updates
   - epoll tweaking
   - rtc updates
   - hfs
   - hfsplus
   - documentation
   - procfs
   - update gcov to gcc-4.7 format
   - IPC"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits)
  ipc, msg: fix message length check for negative values
  ipc/util.c: remove unnecessary work pending test
  devpts: plug the memory leak in kill_sb
  ./Makefile: export initial ramdisk compression config option
  init/Kconfig: add option to disable kernel compression
  drivers: w1: make w1_slave::flags long to avoid memory corruption
  drivers/w1/masters/ds1wm.cuse dev_get_platdata()
  drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page()
  drivers/memstick/core/mspro_block.c: fix attributes array allocation
  drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr
  kernel/panic.c: reduce 1 byte usage for print tainted buffer
  gcov: reuse kbasename helper
  kernel/gcov/fs.c: use pr_warn()
  kernel/module.c: use pr_foo()
  gcov: compile specific gcov implementation based on gcc version
  gcov: add support for gcc 4.7 gcov format
  gcov: move gcov structs definitions to a gcc version specific file
  kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener()
  kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end()
  kernel/sysctl_binary.c: use scnprintf() instead of snprintf()
  ...
2013-11-13 15:45:43 +09:00
Linus Torvalds
9bc9ccd7db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "All kinds of stuff this time around; some more notable parts:

   - RCU'd vfsmounts handling
   - new primitives for coredump handling
   - files_lock is gone
   - Bruce's delegations handling series
   - exportfs fixes

  plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
  ecryptfs: ->f_op is never NULL
  locks: break delegations on any attribute modification
  locks: break delegations on link
  locks: break delegations on rename
  locks: helper functions for delegation breaking
  locks: break delegations on unlink
  namei: minor vfs_unlink cleanup
  locks: implement delegations
  locks: introduce new FL_DELEG lock flag
  vfs: take i_mutex on renamed file
  vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
  vfs: don't use PARENT/CHILD lock classes for non-directories
  vfs: pull ext4's double-i_mutex-locking into common code
  exportfs: fix quadratic behavior in filehandle lookup
  exportfs: better variable name
  exportfs: move most of reconnect_path to helper function
  exportfs: eliminate unused "noprogress" counter
  exportfs: stop retrying once we race with rename/remove
  exportfs: clear DISCONNECTED on all parents sooner
  exportfs: more detailed comment for path_reconnect
  ...
2013-11-13 15:34:18 +09:00
Jason Baron
67347fe4e6 epoll: do not take global 'epmutex' for simple topologies
When calling EPOLL_CTL_ADD for an epoll file descriptor that is attached
directly to a wakeup source, we do not need to take the global 'epmutex',
unless the epoll file descriptor is nested.  The purpose of taking the
'epmutex' on add is to prevent complex topologies such as loops and deep
wakeup paths from forming in parallel through multiple EPOLL_CTL_ADD
operations.  However, for the simple case of an epoll file descriptor
attached directly to a wakeup source (with no nesting), we do not need to
hold the 'epmutex'.

This patch along with 'epoll: optimize EPOLL_CTL_DEL using rcu' improves
scalability on larger systems.  Quoting Nathan Zimmer's mail on SPECjbb
performance:

"On the 16 socket run the performance went from 35k jOPS to 125k jOPS.  In
addition the benchmark when from scaling well on 10 sockets to scaling
well on just over 40 sockets.

...

Currently the benchmark stops scaling at around 40-44 sockets but it seems like
I found a second unrelated bottleneck."

[akpm@linux-foundation.org: use `bool' for boolean variables, remove unneeded/undesirable cast of void*, add missed ep_scan_ready_list() kerneldoc]
Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:25 +09:00
Jason Baron
ae10b2b4eb epoll: optimize EPOLL_CTL_DEL using rcu
Nathan Zimmer found that once we get over 10+ cpus, the scalability of
SPECjbb falls over due to the contention on the global 'epmutex', which is
taken in on EPOLL_CTL_ADD and EPOLL_CTL_DEL operations.

Patch #1 removes the 'epmutex' lock completely from the EPOLL_CTL_DEL path
by using rcu to guard against any concurrent traversals.

Patch #2 remove the 'epmutex' lock from EPOLL_CTL_ADD operations for
simple topologies.  IE when adding a link from an epoll file descriptor to
a wakeup source, where the epoll file descriptor is not nested.

This patch (of 2):

Optimize EPOLL_CTL_DEL such that it does not require the 'epmutex' by
converting the file->f_ep_links list into an rcu one.  In this way, we can
traverse the epoll network on the add path in parallel with deletes.
Since deletes can't create loops or worse wakeup paths, this is safe.

This patch in combination with the patch "epoll: Do not take global 'epmutex'
for simple topologies", shows a dramatic performance improvement in
scalability for SPECjbb.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Tested-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Nelson Elhage <nelhage@nelhage.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
CC: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 12:09:25 +09:00
Rafael J. Wysocki
c511851de1 Revert "epoll: use freezable blocking call"
This reverts commit 1c441e921201 (epoll: use freezable blocking call)
which is reported to cause user space memory corruption to happen
after suspend to RAM.

Since it appears to be extremely difficult to root cause this
problem, it is best to revert the offending commit and try to address
the original issue in a better way later.

References: https://bugzilla.kernel.org/show_bug.cgi?id=61781
Reported-by: Natrio <natrio@list.ru>
Reported-by: Jeff Pohlmeyer <yetanothergeek@gmail.com>
Bisected-by: Leo Wolf <jclw@ymail.com>
Fixes: 1c441e921201 (epoll: use freezable blocking call)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: 3.11+ <stable@vger.kernel.org> # 3.11+
2013-10-30 15:27:53 +01:00
Al Viro
72c2d53192 file->f_op is never NULL...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-24 23:34:54 -04:00
Eric Dumazet
91cf5ab60f epoll: add a reschedule point in ep_free()
ep_free() might iterate on a huge set of epitems and hold cpu too long.
Add two cond_resched() in order to yield cpu to other tasks.  This is safe
as we only hold mutexes in this function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Acked-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:50 -07:00
Al Viro
7e3fb5842e switch epoll_ctl() to fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-03 23:04:44 -04:00
Linus Torvalds
7f0ef0267e Merge branch 'akpm' (updates from Andrew Morton)
Merge first patch-bomb from Andrew Morton:
 - various misc bits
 - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
   distracted.  There has been quite a bit of activity.
 - About half the MM queue
 - Some backlight bits
 - Various lib/ updates
 - checkpatch updates
 - zillions more little rtc patches
 - ptrace
 - signals
 - exec
 - procfs
 - rapidio
 - nbd
 - aoe
 - pps
 - memstick
 - tools/testing/selftests updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (445 commits)
  tools/testing/selftests: don't assume the x bit is set on scripts
  selftests: add .gitignore for kcmp
  selftests: fix clean target in kcmp Makefile
  selftests: add .gitignore for vm
  selftests: add hugetlbfstest
  self-test: fix make clean
  selftests: exit 1 on failure
  kernel/resource.c: remove the unneeded assignment in function __find_resource
  aio: fix wrong comment in aio_complete()
  drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
  drivers/memstick/host/r592.c: convert to module_pci_driver
  drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
  pps-gpio: add device-tree binding and support
  drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
  drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
  drivers/parport/share.c: use kzalloc
  Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
  aoe: update internal version number to v83
  aoe: update copyright date
  aoe: perform I/O completions in parallel
  ...
2013-07-03 17:12:13 -07:00
Oleg Nesterov
77d5591802 signals: eventpoll: do not use sigprocmask()
sigprocmask() should die. None of the current callers actually
need this strange interface.

Change fs/eventpoll.c to use set_current_blocked(). This also
means we should not worry about SIGKILL/SIGSTOP.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eric Wong <normalperson@yhbt.net>
Cc: Jason Baron <jbaron@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03 16:08:01 -07:00
Colin Cross
1c441e9212 epoll: use freezable blocking call
Avoid waking up every thread sleeping in an epoll_wait call during
suspend and resume by calling a freezable blocking call.  Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Colin Cross <ccross@android.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-05-12 14:16:22 +02:00
Linus Torvalds
08d7676083 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull compat cleanup from Al Viro:
 "Mostly about syscall wrappers this time; there will be another pile
  with patches in the same general area from various people, but I'd
  rather push those after both that and vfs.git pile are in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
  syscalls.h: slightly reduce the jungles of macros
  get rid of union semop in sys_semctl(2) arguments
  make do_mremap() static
  sparc: no need to sign-extend in sync_file_range() wrapper
  ppc compat wrappers for add_key(2) and request_key(2) are pointless
  x86: trim sys_ia32.h
  x86: sys32_kill and sys32_mprotect are pointless
  get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
  merge compat sys_ipc instances
  consolidate compat lookup_dcookie()
  convert vmsplice to COMPAT_SYSCALL_DEFINE
  switch getrusage() to COMPAT_SYSCALL_DEFINE
  switch epoll_pwait to COMPAT_SYSCALL_DEFINE
  convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
  switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
  make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect
  make HAVE_SYSCALL_WRAPPERS unconditional
  consolidate cond_syscall and SYSCALL_ALIAS declarations
  teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long
  get rid of duplicate logics in __SC_....[1-6] definitions
2013-05-01 07:21:43 -07:00
Eric Wong
d6d67e7231 epoll: cleanup: use RCU_INIT_POINTER when nulling
It is always safe to use RCU_INIT_POINTER to NULL a pointer.  This results
in slightly smaller/faster code.

Signed-off-by: Eric Wong <normalperson@yhbt.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30 17:04:04 -07:00