141 Commits

Author SHA1 Message Date
Neal Cardwell
ece09f3d68 FROMGIT: net-tcp_bbr: broaden app-limited rate sample detection
This commit is a bug fix for the Linux TCP app-limited logic
used for collecting rate (bandwidth) samples.

Previously the app-limited logic only looked for "bubbles" of
silence in between application writes, by checking at the start
of each sendmsg. But "bubbles" of silence can also happen before
retransmits: e.g. bubbles can happen between an application write
and a retransmit, or between two retransmits.

Retransmits are triggered by ACKs or timers. So this commit checks
for bubbles of app-limited silence upon ACKs or timers.

Why does this commit check for app-limited state at the start of
ACKs and timer handling? Because at that point we know whether
inflight was fully using the cwnd.  During processing the ACK or
timer event we often change the cwnd; after changing the cwnd we
can't know whether inflight was fully using the old cwnd.

Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
Change-Id: I37221506f5166877c2b110753d39bb0757985e68
(cherry-picked from 80d039a0a0)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
2021-06-11 23:23:28 +05:30
Srinivasarao P
8241b06f7c Merge android-4.14.159 (f960b38) into msm-4.14
* refs/heads/tmp-f960b38:
  Linux 4.14.159
  of: unittest: fix memory leak in attach_node_and_children
  raid5: need to set STRIPE_HANDLE for batch head
  gpiolib: acpi: Add Terra Pad 1061 to the run_edge_events_on_boot_blacklist
  kernel/module.c: wakeup processes in module_wq on module unload
  gfs2: fix glock reference problem in gfs2_trans_remove_revoke
  net/mlx5e: Fix SFF 8472 eeprom length
  sunrpc: fix crash when cache_head become valid before update
  workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
  blk-mq: make sure that line break can be printed
  mfd: rk808: Fix RK818 ID template
  ext4: fix a bug in ext4_wait_for_tail_page_commit
  mm/shmem.c: cast the type of unmap_start to u64
  firmware: qcom: scm: Ensure 'a0' status code is treated as signed
  ext4: work around deleting a file with i_nlink == 0 safely
  powerpc: Fix vDSO clock_getres()
  powerpc: Avoid clang warnings around setjmp and longjmp
  ath10k: fix fw crash by moving chip reset after napi disabled
  media: vimc: fix component match compare
  mlxsw: spectrum_router: Refresh nexthop neighbour when it becomes dead
  power: supply: cpcap-battery: Fix signed counter sample register
  x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
  x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
  e100: Fix passing zero to 'PTR_ERR' warning in e100_load_ucode_wait
  drbd: Change drbd_request_detach_interruptible's return type to int
  scsi: lpfc: Correct code setting non existent bits in sli4 ABORT WQE
  scsi: lpfc: Cap NPIV vports to 256
  omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251
  phy: renesas: rcar-gen3-usb2: Fix sysfs interface of "role"
  iio: adis16480: Add debugfs_reg_access entry
  xhci: make sure interrupts are restored to correct state
  xhci: Fix memory leak in xhci_add_in_port()
  scsi: qla2xxx: Fix message indicating vectors used by driver
  scsi: qla2xxx: Always check the qla2x00_wait_for_hba_online() return value
  scsi: qla2xxx: Fix qla24xx_process_bidir_cmd()
  scsi: qla2xxx: Fix session lookup in qlt_abort_work()
  scsi: qla2xxx: Fix DMA unmap leak
  scsi: zfcp: trace channel log even for FCP command responses
  block: fix single range discard merge
  reiserfs: fix extended attributes on the root directory
  ext4: Fix credit estimate for final inode freeing
  quota: fix livelock in dquot_writeback_dquots
  ext2: check err when partial != NULL
  quota: Check that quota is not dirty before release
  video/hdmi: Fix AVI bar unpack
  powerpc/xive: Skip ioremap() of ESB pages for LSI interrupts
  powerpc: Allow flush_icache_range to work across ranges >4GB
  powerpc/xive: Prevent page fault issues in the machine crash handler
  powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB
  ppdev: fix PPGETTIME/PPSETTIME ioctls
  ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity
  mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card
  pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init
  pinctrl: samsung: Fix device node refcount leaks in init code
  pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init
  pinctrl: samsung: Add of_node_put() before return in error path
  ACPI: PM: Avoid attaching ACPI PM domain to certain devices
  ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data()
  ACPI: OSL: only free map once in osl.c
  cpufreq: powernv: fix stack bloat and hard limit on number of CPUs
  PM / devfreq: Lock devfreq in trans_stat_show
  intel_th: pci: Add Tiger Lake CPU support
  intel_th: pci: Add Ice Lake CPU support
  intel_th: Fix a double put_device() in error path
  cpuidle: Do not unset the driver if it is there already
  media: cec.h: CEC_OP_REC_FLAG_ values were swapped
  media: radio: wl1273: fix interrupt masking on release
  media: bdisp: fix memleak on release
  s390/mm: properly clear _PAGE_NOEXEC bit when it is not supported
  ar5523: check NULL before memcpy() in ar5523_cmd()
  cgroup: pids: use atomic64_t for pids->limit
  blk-mq: avoid sysfs buffer overflow with too many CPU cores
  ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report
  workqueue: Fix pwq ref leak in rescuer_thread()
  workqueue: Fix spurious sanity check failures in destroy_workqueue()
  dm zoned: reduce overhead of backing device checks
  hwrng: omap - Fix RNG wait loop timeout
  watchdog: aspeed: Fix clock behaviour for ast2600
  md/raid0: Fix an error message in raid0_make_request()
  ALSA: hda - Fix pending unsol events at shutdown
  ovl: relax WARN_ON() on rename to self
  lib: raid6: fix awk build warnings
  rtlwifi: rtl8192de: Fix missing enable interrupt flag
  rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer
  rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address
  btrfs: record all roots for rename exchange on a subvol
  Btrfs: send, skip backreference walking for extents with many references
  btrfs: Remove btrfs_bio::flags member
  Btrfs: fix negative subv_writers counter and data space leak after buffered write
  btrfs: use refcount_inc_not_zero in kill_all_nodes
  btrfs: check page->mapping when loading free space cache
  usb: dwc3: ep0: Clear started flag on completion
  virtio-balloon: fix managed page counts when migrating pages between zones
  mtd: spear_smi: Fix Write Burst mode
  tpm: add check after commands attribs tab allocation
  usb: mon: Fix a deadlock in usbmon between mmap and read
  usb: core: urb: fix URB structure initialization function
  USB: adutux: fix interface sanity check
  USB: serial: io_edgeport: fix epic endpoint lookup
  USB: idmouse: fix interface sanity checks
  USB: atm: ueagle-atm: add missing endpoint check
  iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting
  ARM: dts: pandora-common: define wl1251 as child node of mmc3
  xhci: handle some XHCI_TRUST_TX_LENGTH quirks cases as default behaviour.
  xhci: Increase STS_HALT timeout in xhci_suspend()
  usb: xhci: only set D3hot for pci device
  staging: gigaset: add endpoint-type sanity check
  staging: gigaset: fix illegal free on probe errors
  staging: gigaset: fix general protection fault on probe
  staging: rtl8712: fix interface sanity check
  staging: rtl8188eu: fix interface sanity check
  usb: Allow USB device to be warm reset in suspended state
  USB: documentation: flags on usb-storage versus UAS
  USB: uas: heed CAPACITY_HEURISTICS
  USB: uas: honor flag to avoid CAPACITY16
  media: venus: remove invalid compat_ioctl32 handler
  scsi: qla2xxx: Fix driver unload hang
  usb: gadget: pch_udc: fix use after free
  usb: gadget: configfs: Fix missing spin_lock_init()
  appletalk: Set error code if register_snap_client failed
  appletalk: Fix potential NULL pointer dereference in unregister_snap_client
  KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
  ASoC: rsnd: fixup MIX kctrl registration
  binder: Handle start==NULL in binder_update_page_range()
  thermal: Fix deadlock in thermal thermal_zone_device_check
  iomap: Fix pipe page leakage during splicing
  RDMA/qib: Validate ->show()/store() callbacks before calling them
  spi: atmel: Fix CS high support
  crypto: user - fix memory leak in crypto_report
  crypto: ecdh - fix big endian bug in ECC library
  crypto: ccp - fix uninitialized list head
  crypto: af_alg - cast ki_complete ternary op to int
  crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr
  KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
  KVM: x86: do not modify masked bits of shared MSRs
  KVM: arm/arm64: vgic: Don't rely on the wrong pending table
  drm/i810: Prevent underflow in ioctl
  jbd2: Fix possible overflow in jbd2_log_space_left()
  kernfs: fix ino wrap-around detection
  can: slcan: Fix use-after-free Read in slcan_open
  tty: vt: keyboard: reject invalid keycodes
  CIFS: Fix SMB2 oplock break processing
  CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
  x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect
  Input: Fix memory leak in psxpad_spi_probe
  coresight: etm4x: Fix input validation for sysfs.
  Input: goodix - add upside-down quirk for Teclast X89 tablet
  Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers
  Input: synaptics-rmi4 - re-enable IRQs in f34v7_do_reflash
  Input: synaptics - switch another X1 Carbon 6 to RMI/SMbus
  ALSA: hda - Add mute led support for HP ProBook 645 G4
  ALSA: pcm: oss: Avoid potential buffer overflows
  ALSA: hda/realtek - Dell headphone has noise on unmute for ALC236
  fuse: verify attributes
  fuse: verify nlink
  sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
  tcp: exit if nothing to retransmit on RTO timeout
  net: aquantia: fix RSS table and key sizes
  media: vimc: fix start stream when link is disabled
  ARM: dts: sunxi: Fix PMU compatible strings
  usb: mtu3: fix dbginfo in qmu_tx_zlp_error_handler
  mlx4: Use snprintf instead of complicated strcpy
  IB/hfi1: Close VNIC sdma_progress sleep window
  IB/hfi1: Ignore LNI errors before DC8051 transitions to Polling state
  mlxsw: spectrum_router: Relax GRE decap matching check
  firmware: qcom: scm: fix compilation error when disabled
  media: stkwebcam: Bugfix for wrong return values
  tty: Don't block on IO when ldisc change is pending
  nfsd: Return EPERM, not EACCES, in some SETATTR cases
  MIPS: OCTEON: cvmx_pko_mem_debug8: use oldest forward compatible definition
  clk: renesas: r8a77995: Correct parent clock of DU
  powerpc/math-emu: Update macros from GCC
  pstore/ram: Avoid NULL deref in ftrace merging failure path
  net/mlx4_core: Fix return codes of unsupported operations
  dlm: fix invalid cluster name warning
  ARM: dts: realview: Fix some more duplicate regulator nodes
  clk: sunxi-ng: h3/h5: Fix CSI_MCLK parent
  ARM: dts: pxa: clean up USB controller nodes
  mtd: fix mtd_oobavail() incoherent returned value
  kbuild: fix single target build for external module
  modpost: skip ELF local symbols during section mismatch check
  tcp: fix SNMP TCP timeout under-estimation
  tcp: fix SNMP under-estimation on failed retransmission
  tcp: fix off-by-one bug on aborting window-probing socket
  ARM: dts: realview-pbx: Fix duplicate regulator nodes
  ARM: dts: mmp2: fix the gpio interrupt cell number
  net/x25: fix null_x25_address handling
  net/x25: fix called/calling length calculation in x25_parse_address_block
  arm64: dts: meson-gxl-khadas-vim: fix GPIO lines names
  arm64: dts: meson-gxbb-odroidc2: fix GPIO lines names
  arm64: dts: meson-gxbb-nanopi-k2: fix GPIO lines names
  arm64: dts: meson-gxl-libretech-cc: fix GPIO lines names
  ARM: OMAP1/2: fix SoC name printing
  ASoC: au8540: use 64-bit arithmetic instead of 32-bit
  nfsd: fix a warning in __cld_pipe_upcall()
  ARM: debug: enable UART1 for socfpga Cyclone5
  dlm: NULL check before kmem_cache_destroy is not needed
  ARM: dts: sun8i: v3s: Change pinctrl nodes to avoid warning
  ARM: dts: sun5i: a10s: Fix HDMI output DTC warning
  ASoC: rsnd: tidyup registering method for rsnd_kctrl_new()
  lockd: fix decoding of TEST results
  i2c: imx: don't print error message on probe defer
  serial: imx: fix error handling in console_setup
  altera-stapl: check for a null key before strcasecmp'ing it
  dma-mapping: fix return type of dma_set_max_seg_size()
  sparc: Correct ctx->saw_frame_pointer logic.
  f2fs: fix to allow node segment for GC by ioctl path
  ARM: dts: rockchip: Assign the proper GPIO clocks for rv1108
  ARM: dts: rockchip: Fix the PMU interrupt number for rv1108
  f2fs: change segment to section in f2fs_ioc_gc_range
  f2fs: fix count of seg_freed to make sec_freed correct
  ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()
  usb: dwc3: don't log probe deferrals; but do log other error codes
  usb: dwc3: debugfs: Properly print/set link state for HS
  dmaengine: dw-dmac: implement dma protection control setting
  dmaengine: coh901318: Remove unused variable
  dmaengine: coh901318: Fix a double-lock bug
  media: cec: report Vendor ID after initialization
  media: pulse8-cec: return 0 when invalidating the logical address
  ARM: dts: exynos: Use Samsung SoC specific compatible for DWC2 module
  rtc: dt-binding: abx80x: fix resistance scale
  rtc: max8997: Fix the returned value in case of error in 'max8997_rtc_read_alarm()'
  math-emu/soft-fp.h: (_FP_ROUND_ZERO) cast 0 to void to fix warning
  net/smc: use after free fix in smc_wr_tx_put_slot()
  MIPS: OCTEON: octeon-platform: fix typing
  iomap: sub-block dio needs to zeroout beyond EOF
  net-next/hinic:fix a bug in set mac address
  regulator: Fix return value of _set_load() stub
  clk: rockchip: fix ID of 8ch clock of I2S1 for rk3328
  clk: rockchip: fix I2S1 clock gate register for rk3328
  mm/vmstat.c: fix NUMA statistics updates
  Staging: iio: adt7316: Fix i2c data reading, set the data field
  pinctrl: qcom: ssbi-gpio: fix gpio-hog related boot issues
  crypto: bcm - fix normal/non key hash algorithm failure
  crypto: ecc - check for invalid values in the key verification test
  scsi: zfcp: drop default switch case which might paper over missing case
  net: dsa: mv88e6xxx: Work around mv886e6161 SERDES missing MII_PHYSID2
  MIPS: SiByte: Enable ZONE_DMA32 for LittleSur
  dlm: fix missing idr_destroy for recover_idr
  ARM: dts: rockchip: Fix rk3288-rock2 vcc_flash name
  clk: rockchip: fix rk3188 sclk_mac_lbtest parameter ordering
  clk: rockchip: fix rk3188 sclk_smc gate data
  i40e: don't restart nway if autoneg not supported
  rtc: s3c-rtc: Avoid using broken ALMYEAR register
  net: ethernet: ti: cpts: correct debug for expired txq skb
  extcon: max8997: Fix lack of path setting in USB device mode
  dlm: fix possible call to kfree() for non-initialized pointer
  clk: sunxi-ng: a64: Fix gate bit of DSI DPHY
  net/mlx5: Release resource on error flow
  ARM: 8813/1: Make aligned 2-byte getuser()/putuser() atomic on ARMv6+
  iwlwifi: mvm: Send non offchannel traffic via AP sta
  iwlwifi: mvm: synchronize TID queue removal
  cxgb4vf: fix memleak in mac_hlist initialization
  serial: core: Allow processing sysrq at port unlock time
  i2c: core: fix use after free in of_i2c_notify
  net: ep93xx_eth: fix mismatch of request_mem_region in remove
  rsxx: add missed destroy_workqueue calls in remove
  ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
  sched/core: Avoid spurious lock dependencies
  Input: cyttsp4_core - fix use after free bug
  xfrm: release device reference for invalid state
  NFC: nxp-nci: Fix NULL pointer dereference after I2C communication error
  audit_get_nd(): don't unlock parent too early
  exportfs_decode_fh(): negative pinned may become positive without the parent locked
  iwlwifi: pcie: don't consider IV len in A-MSDU
  RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LEN
  autofs: fix a leak in autofs_expire_indirect()
  serial: ifx6x60: add missed pm_runtime_disable
  serial: serial_core: Perform NULL checks for break_ctl ops
  serial: pl011: Fix DMA ->flush_buffer()
  tty: serial: msm_serial: Fix flow control
  tty: serial: fsl_lpuart: use the sg count from dma_map_sg
  usb: gadget: u_serial: add missing port entry locking
  arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator
  rsi: release skb if rsi_prepare_beacon fails
  ANDROID: staging: android: ion: Fix build when CONFIG_ION_SYSTEM_HEAP=n
  ANDROID: staging: android: ion: Expose total heap and pool sizes via sysfs
  UPSTREAM: include/linux/slab.h: fix sparse warning in kmalloc_type()
  UPSTREAM: mm, slab: shorten kmalloc cache names for large sizes
  UPSTREAM: mm, proc: add KReclaimable to /proc/meminfo
  BACKPORT: mm: rename and change semantics of nr_indirectly_reclaimable_bytes
  UPSTREAM: dcache: allocate external names from reclaimable kmalloc caches
  BACKPORT: mm, slab/slub: introduce kmalloc-reclaimable caches
  UPSTREAM: mm, slab: combine kmalloc_caches and kmalloc_dma_caches
  ANDROID: kbuild: disable SCS by default in allmodconfig
  ANDROID: arm64: cuttlefish_defconfig: enable LTO, CFI, and SCS
  BACKPORT: FROMLIST: arm64: implement Shadow Call Stack
  FROMLIST: arm64: disable SCS for hypervisor code
  BACKPORT: FROMLIST: arm64: vdso: disable Shadow Call Stack
  FROMLIST: arm64: preserve x18 when CPU is suspended
  FROMLIST: arm64: reserve x18 from general allocation with SCS
  FROMLIST: arm64: disable function graph tracing with SCS
  FROMLIST: scs: add support for stack usage debugging
  FROMLIST: scs: add accounting
  FROMLIST: add support for Clang's Shadow Call Stack (SCS)
  FROMLIST: arm64: kernel: avoid x18 in __cpu_soft_restart
  FROMLIST: arm64: kvm: stop treating register x18 as caller save
  FROMLIST: arm64/lib: copy_page: avoid x18 register in assembler code
  FROMLIST: arm64: mm: avoid x18 in idmap_kpti_install_ng_mappings
  ANDROID: use non-canonical CFI jump tables
  ANDROID: arm64: add __nocfi to __apply_alternatives
  ANDROID: arm64: add __pa_function
  ANDROID: arm64: allow ThinLTO to be selected
  ANDROID: soc/tegra: disable ARCH_TEGRA_210_SOC with LTO
  FROMLIST: arm64: fix alternatives with LLVM's integrated assembler
  ANDROID: irqchip/gic-v3: rename gic_of_init to work around a ThinLTO+CFI bug
  ANDROID: kbuild: limit LTO inlining
  ANDROID: kbuild: merge module sections with LTO
  ANDROID: init: ensure initcall ordering with LTO
  Revert "ANDROID: HACK: init: ensure initcall ordering with LTO"
  ANDROID: add support for ThinLTO
  ANDROID: Switch to LLD
  ANDROID: clang: update to 10.0.1
  ANDROID: arm64: add atomic_ll_sc.o to obj-y if using lld
  ANDROID: enable ARM64_ERRATUM_843419 by default with LTO_CLANG
  ANDROID: kbuild: allow lld to be used with CONFIG_LTO_CLANG
  ANDROID: Makefile: set -Qunused-arguments sooner
  BACKPORT: FROMLIST: Makefile: lld: tell clang to use lld
  BACKPORT: FROMLIST: Makefile: lld: set -O2 linker flag when linking with LLD
  ANDROID: scripts/Kbuild: add ld-name support for ld.lld
  UPSTREAM: bpf: permit multiple bpf attachments for a single perf event
  UPSTREAM: bpf: use the same condition in perf event set/free bpf handler
  UPSTREAM: bpf: multi program support for cgroup+bpf
  BACKPORT: serdev: make synchronous write return bytes written
  UPSTREAM: gnss: serial: fix synchronous write timeout
  UPSTREAM: gnss: fix potential error pointer dereference
  BACKPORT: gnss: add receiver type support
  UPSTREAM: dt-bindings: add generic gnss binding
  UPSTREAM: gnss: add generic serial driver
  ANDROID: cuttlefish_defconfig: Enable CONFIG_SERIAL_DEV_BUS
  ANDROID: cuttlefish_defconfig: Enable CONFIG_GNSS
  BACKPORT: gnss: add GNSS receiver subsystem
  UPSTREAM: arm64: Validate tagged addresses in access_ok() called from kernel threads
  BACKPORT: ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer
  fs/lock: skip lock owner pid translation in case we are in init_pid_ns
  f2fs: stop GC when the victim becomes fully valid
  f2fs: expose main_blkaddr in sysfs
  f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
  f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
  f2fs: show f2fs instance in printk_ratelimited
  f2fs: fix potential overflow
  f2fs: fix to update dir's i_pino during cross_rename
  f2fs: support aligned pinned file
  f2fs: avoid kernel panic on corruption test
  f2fs: fix wrong description in document
  f2fs: cache global IPU bio
  f2fs: fix to avoid memory leakage in f2fs_listxattr
  f2fs: check total_segments from devices in raw_super
  f2fs: update multi-dev metadata in resize_fs
  f2fs: mark recovery flag correctly in read_raw_super_block()
  f2fs: fix to update time in lazytime mode
  vfs: don't allow writes to swap files
  mm: set S_SWAPFILE on blockdev swap devices

Conflicts:
	drivers/Makefile
	drivers/staging/android/ion/ion.c
	drivers/staging/android/ion/ion.h
	drivers/staging/android/ion/ion_page_pool.c
	drivers/usb/dwc3/core.c
	drivers/usb/dwc3/debugfs.c
	drivers/usb/dwc3/ep0.c
	fs/f2fs/data.c
	include/linux/mmzone.h
	mm/vmstat.c

Discarded below patches, as usb patches not applicable and block patch
causing stability issues:
	usb: dwc3: ep0: Clear started flag on completion
	usb: dwc3: don't log probe deferrals; but do log other error codes
	block: fix single range discard merge

Fixed build errors in below files:
	drivers/gpu/msm/kgsl_pool.c
	drivers/staging/android/ion/ion_page_pool.c
	kernel/taskstats.c

Fixed bootup issue in:
	arch/arm64/mm/proc.s

Change-Id: I0a16824c251c14c63af78f9cfd9ede5e82c427fc
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2020-04-17 17:47:52 +05:30
Eric Dumazet
6d9175b955 tcp: exit if nothing to retransmit on RTO timeout
Two upstream commits squashed together for v4.14 stable :

 commit 88f8598d0a302a08380eadefd09b9f5cb1c4c428 upstream.

  Previously TCP only warns if its RTO timer fires and the
  retransmission queue is empty, but it'll cause null pointer
  reference later on. It's better to avoid such catastrophic failure
  and simply exit with a warning.

Squashed with "tcp: refactor tcp_retransmit_timer()" :

 commit 0d580fbd2db084a5c96ee9c00492236a279d5e0f upstream.

  It appears linux-4.14 stable needs a backport of commit
  88f8598d0a30 ("tcp: exit if nothing to retransmit on RTO timeout")

  Since tcp_rtx_queue_empty() is not in pre 4.15 kernels,
  let's refactor tcp_retransmit_timer() to only use tcp_rtx_queue_head()

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17 20:38:43 +01:00
Yuchung Cheng
2e117bb2ab tcp: fix SNMP TCP timeout under-estimation
[ Upstream commit e1561fe2dd69dc5dddd69bd73aa65355bdfb048b ]

Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
1. It counts all SYN and SYN-ACK timeouts
2. It counts timeouts in other states except recurring timeouts and
   timeouts after fast recovery or disorder state.

Such selective accounting makes analysis difficult and complicated. For
example the monitoring system needs to collect many other SNMP counters
to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
counter simply counts all the retransmit timeout (SYN or data or FIN).

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17 20:38:29 +01:00
Yuchung Cheng
5b7ff64f50 tcp: fix off-by-one bug on aborting window-probing socket
[ Upstream commit 3976535af0cb9fe34a55f2ffb8d7e6b39a2f8188 ]

Previously there is an off-by-one bug on determining when to abort
a stalled window-probing socket. This patch fixes that so it is
consistent with tcp_write_timeout().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17 20:38:28 +01:00
Blagovest Kolenichev
045b2f2096 Merge android-4.14.127 (7fbc8c5) into msm-4.14
* refs/heads/tmp-7fbc8c5:
  Linux 4.14.127
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  tcp: reduce tcp_fastretrans_alert() verbosity

Change-Id: Iab96ac60178018390d1442bb6be7a01061f5f452
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2019-07-23 11:00:36 -07:00
Eric Dumazet
f2aa4f1a05 tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
commit 967c05aee439e6e5d7d805e195b3a20ef5c433d6 upstream.

If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.

Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-17 19:52:45 +02:00
Ravi Joshi
86b334d715 WLAN subsystem: Sysctl support for key TCP/IP parameters
It has been observed that default values for some of key tcp/ip
parameters are affecting the tput/performance of the system. Hence
extending configuration capabilities to TCP/Ip stack through
sysctl interface

Change-Id: I4287e9103769535f43e0934bac08435a524ee6a4
CRs-Fixed: 507581
Signed-off-by: Ravi Joshi <ravij@codeaurora.org>
Signed-off-by: Ganesh Babu Kumaravel <kganesh@codeaurora.org>
Signed-off-by: Mohit Khanna <mkhannaqca@codeaurora.org>
Signed-off-by: Manjunathappa Prakash <prakashpm@codeaurora.org>
2018-10-10 11:45:25 -07:00
Soheil Hassas Yeganeh
e44c173305 tcp: purge write queue upon aborting the connection
[ Upstream commit e05836ac07c77dd90377f8c8140bce2a44af5fe7 ]

When the connection is aborted, there is no point in
keeping the packets on the write queue until the connection
is closed.

Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
this is essential for a correct MSG_ZEROCOPY implementation,
because userspace cannot call close(fd) before receiving
zerocopy signals even when the connection is aborted.

Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-31 18:10:38 +02:00
Dan Streetman
32e57f8c55 net: tcp: close sock if net namespace is exiting
[ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]

When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.

For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open.  However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open.  In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence.  The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:

unregister_netdevice: waiting for lo to become free. Usage count = 1

These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.

After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.

Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31 14:03:45 +01:00
Eric Dumazet
5504319c69 tcp: refresh tcp_mstamp from timers callbacks
[ Upstream commit 4688eb7cf3ae2c2721d1dacff5c1384cba47d176 ]

Only the retransmit timer currently refreshes tcp_mstamp

We should do the same for delayed acks and keepalives.

Even if RFC 7323 does not request it, this is consistent to what linux
did in the past, when TS values were based on jiffies.

Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Mike Maloney <maloney@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by:  Mike Maloney <maloney@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-02 20:31:11 +01:00
David S. Miller
3118e6e19d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.

The TCP conflict was a case of local variables in a function
being removed from both net and net-next.

In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.

Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-09 16:28:45 -07:00
Eric Dumazet
2dda640040 net: fix keepalive code vs TCP_FASTOPEN_CONNECT
syzkaller was able to trigger a divide by 0 in TCP stack [1]

Issue here is that keepalive timer needs to be updated to not attempt
to send a probe if the connection setup was deferred using
TCP_FASTOPEN_CONNECT socket option added in linux-4.11

[1]
 divide error: 0000 [#1] SMP
 CPU: 18 PID: 0 Comm: swapper/18 Not tainted
 task: ffff986f62f4b040 ti: ffff986f62fa2000 task.ti: ffff986f62fa2000
 RIP: 0010:[<ffffffff8409cc0d>]  [<ffffffff8409cc0d>] __tcp_select_window+0x8d/0x160
 Call Trace:
  <IRQ>
  [<ffffffff8409d951>] tcp_transmit_skb+0x11/0x20
  [<ffffffff8409da21>] tcp_xmit_probe_skb+0xc1/0xe0
  [<ffffffff840a0ee8>] tcp_write_wakeup+0x68/0x160
  [<ffffffff840a151b>] tcp_keepalive_timer+0x17b/0x230
  [<ffffffff83b3f799>] call_timer_fn+0x39/0xf0
  [<ffffffff83b40797>] run_timer_softirq+0x1d7/0x280
  [<ffffffff83a04ddb>] __do_softirq+0xcb/0x257
  [<ffffffff83ae03ac>] irq_exit+0x9c/0xb0
  [<ffffffff83a04c1a>] smp_apic_timer_interrupt+0x6a/0x80
  [<ffffffff83a03eaf>] apic_timer_interrupt+0x7f/0x90
  <EOI>
  [<ffffffff83fed2ea>] ? cpuidle_enter_state+0x13a/0x3b0
  [<ffffffff83fed2cd>] ? cpuidle_enter_state+0x11d/0x3b0

Tested:

Following packetdrill no longer crashes the kernel

`echo 0 >/proc/sys/net/ipv4/tcp_timestamps`

// Cache warmup: send a Fast Open cookie request
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(3, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation is now in progress)
   +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
 +.01 < S. 123:123(0) ack 1 win 14600 <mss 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
   +0 > . 1:1(0) ack 1
   +0 close(3) = 0
   +0 > F. 1:1(0) ack 1
   +0 < F. 1:1(0) ack 2 win 92
   +0 > .  2:2(0) ack 2

   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
   +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
   +0 setsockopt(4, SOL_TCP, TCP_FASTOPEN_CONNECT, [1], 4) = 0
   +0 setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
 +.01 connect(4, ..., ...) = 0
   +0 setsockopt(4, SOL_TCP, TCP_KEEPIDLE, [5], 4) = 0
   +10 close(4) = 0

`echo 1 >/proc/sys/net/ipv4/tcp_timestamps`

Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Wei Wang <weiwan@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-03 09:34:51 -07:00
Florian Westphal
e7942d0633 tcp: remove prequeue support
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.

This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.

In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.

Even normal client applications (web browsers) commonly use many tcp
connections in parallel.

This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-31 14:37:49 -07:00
Eric Dumazet
ce682ef6e3 tcp: fix TCP_SYNCNT flakes
After the mentioned commit, some of our packetdrill tests became flaky.

TCP_SYNCNT socket option can limit the number of SYN retransmits.

retransmits_timed_out() has to compare times computations based on
local_clock() while timers are based on jiffies. With NTP adjustments
and roundings we can observe 999 ms delay for 1000 ms timers.
We end up sending one extra SYN packet.

Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate
TCP's connection close threshold as a time value") makes no
real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at
all.

Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set
parameter from retransmits_timed_out()

Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-24 16:29:57 -04:00
Eric Dumazet
4ab688793e tcp: fix tcp_probe_timer() for TCP_USER_TIMEOUT
TCP_USER_TIMEOUT is still converted to jiffies value in
icsk_user_timeout

So we need to make a conversion for the cases HZ != 1000

Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-21 13:50:34 -04:00
Eric Dumazet
9a568de481 tcp: switch TCP TS option (RFC 7323) to 1ms clock
TCP Timestamps option is defined in RFC 7323

Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.

For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)

For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]

Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.

This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.

This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.

[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Eric Dumazet
c74df29a8d tcp: use tcp_jiffies32 to feed probe_timestamp
Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Eric Dumazet
70eabf0e1b tcp: use tcp_jiffies32 for rcv_tstamp and lrcvtime
Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Eric Dumazet
d635fbe27e tcp: use tcp_jiffies32 to feed tp->lsndtime
Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->lsndtime.

tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Eric Dumazet
385e20706f tcp: use tp->tcp_mstamp in output path
Idea is to later convert tp->tcp_mstamp to a full u64 counter
using usec resolution, so that we can later have fine
grained TCP TS clock (RFC 7323), regardless of HZ value.

We try to refresh tp->tcp_mstamp only when necessary.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-17 16:06:01 -04:00
Eric Dumazet
218af599fa tcp: internal implementation for pacing
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.

However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
  flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
  to use BBR for the few TCP flows they initiate/terminate.

This patch implements an automatic fallback to internal pacing.

Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.

One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.

Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.

Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.

If MQ+pfifo_fast is used on the NIC :

$ sar -n DEV 1 5 | grep eth
14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
 4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
 0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
 1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
 1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0

Now use MQ+FQ :

lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq

$ sar -n DEV 1 5 | grep eth
14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
 1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
 0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
 4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
 3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0

As expected, number of interrupts per second is very different.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-16 15:43:31 -04:00
Wei Wang
59450f8d83 net/tcp_fastopen: Remove mss check in tcp_write_timeout()
Christoph Paasch from Apple found another firewall issue for TFO:
After successful 3WHS using TFO, server and client starts to exchange
data. Afterwards, a 10s idle time occurs on this connection. After that,
firewall starts to drop every packet on this connection.

The fix for this issue is to extend existing firewall blackhole detection
logic in tcp_write_timeout() by removing the mss check.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-24 14:27:17 -04:00
Eric Dumazet
02b2faaf0a tcp: fix various issues for sockets morphing to listen state
Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting
tcp_disconnect() path that was never really considered and/or used
before syzkaller ;)

I was not able to reproduce the bug, but it seems issues here are the
three possible actions that assumed they would never trigger on a
listener.

1) tcp_write_timer_handler
2) tcp_delack_timer_handler
3) MTU reduction

Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN
 states from tcp_v6_mtu_reduced()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-03-07 13:58:33 -08:00
Yuchung Cheng
bec41a11dd tcp: remove early retransmit
This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with <3 dupacks) because it is
subsumed by the new RACK loss detection. More specifically when
RACK receives DUPACKs, it'll arm a reordering timer to start fast
recovery after a quarter of (min)RTT, hence it covers the early
retransmit except RACK does not limit itself to specific inflight
or dupack numbers.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Yuchung Cheng
57dde7f70d tcp: add reordering timer in RACK loss detection
This patch makes RACK install a reordering timer when it suspects
some packets might be lost, but wants to delay the decision
a little bit to accomodate reordering.

It does not create a new timer but instead repurposes the existing
RTO timer, because both are meant to retransmit packets.
Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
the RACK timing check fails. The wait time is set to

  RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

This translates to expecting a packet (Packet) should take
(RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

When there are multiple packets that need a timer, we use one timer
with the maximum timeout. Therefore the timer conservatively uses
the maximum window to expire N packets by one timeout, instead of
N timeouts to expire N packets sent at different times.

The fudge factor is 2 jiffies to ensure when the timer fires, all
the suspected packets would exceed the deadline and be marked lost
by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
clock may tick between calling icsk_reset_xmit_timer(timeout) and
actually hang the timer. The next jiffy is to lower-bound the timeout
to 2 jiffies when reo_wnd is < 1ms.

When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
in Recovery we'll enter fast recovery and force fast retransmit.
This is very similar to the early retransmit (RFC5827) except RACK
is not constrained to only enter recovery for small outstanding
flights.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-13 22:37:16 -05:00
Ursula Braun
4b9d07a440 net: introduce keepalive function in struct proto
Direct call of tcp_set_keepalive() function from protocol-agnostic
sock_setsockopt() function in net/core/sock.c violates network
layering. And newly introduced protocol (SMC-R) will need its own
keepalive function. Therefore, add "keepalive" function pointer
to "struct proto", and call it from sock_setsockopt() via this pointer.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2017-01-09 16:07:37 -05:00
Eric Dumazet
7aa5470c2c tcp: tsq: move tsq_flags close to sk_wmem_alloc
tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.

Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-05 13:32:24 -05:00
Lawrence Brakmo
3acf3ec3f4 tcp: Change txhash on every SYN and RTO retransmit
The current code changes txhash (flowlables) on every retransmitted
SYN/ACK, but only after the 2nd retransmitted SYN and only after
tcp_retries1 RTO retransmits.

With this patch:
1) txhash is changed with every SYN retransmits
2) txhash is changed with every RTO.

The result is that we can start re-routing around failed (or very
congested paths) as soon as possible. Otherwise application health
checks may fail and the connection may be terminated before we start
to change txhash.

v4: Removed sysctl, txhash is changed for all RTOs
v3: Removed text saying default value of sysctl is 0 (it is 100)
v2: Added sysctl documentation and cleaned code

Tested with packetdrill tests

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-28 07:52:34 -04:00
Yuchung Cheng
7e32b44361 tcp: properly account Fast Open SYN-ACK retrans
Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-22 03:33:01 -04:00
Richard Sailer
c380d37e97 tcp_timer.c: Add kernel-doc function descriptions
This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.

Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-15 23:18:14 -07:00
Eric Dumazet
c10d9310ed tcp: do not assume TCP code is non preemptible
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02 17:02:25 -04:00
Eric Dumazet
02a1d6e7a6 net: rename NET_{ADD|INC}_STATS_BH()
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27 22:48:24 -04:00
Eric Dumazet
10d3be5692 tcp-tso: do not split TSO packets at retransmit time
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.

This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.

Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
 - Less memory overhead, because write queues have less skbs
 - Less cpu overhead at ACK processing.
 - Better SACK processing, as lot of studies mentioned how
   awful linux was at this ;)
 - Less cpu overhead to send the rtx packets
   (IP stack traversal, netfilter traversal, drivers...)
 - Better latencies in presence of losses.
 - Smaller spikes in fq like packet schedulers, as retransmits
   are not constrained by TCP Small Queues.

1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-24 14:43:59 -04:00
Nikolay Borisov
c402d9beff ipv4: Namespaceify tcp_orphan_retries sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:11 -05:00
Nikolay Borisov
c6214a97c8 ipv4: Namespaceify tcp_retries2 sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:11 -05:00
Nikolay Borisov
ae5c3f406c ipv4: Namespaceify tcp_retries1 sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Nikolay Borisov
7c083ecb3b ipv4: Namespaceify tcp synack retries sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Nikolay Borisov
6fa2516630 ipv4: Namespaceify tcp syn retries sysctl knob
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-07 14:35:10 -05:00
Nikolay Borisov
b840d15d39 ipv4: Namespecify the tcp_keepalive_intvl sysctl knob
This is the final part required to namespaceify the tcp
keep alive mechanism.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Nikolay Borisov
9bd6861bd4 ipv4: Namespecify tcp_keepalive_probes sysctl knob
This is required to have full tcp keepalive mechanism namespace
support.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Nikolay Borisov
13b287e8d1 ipv4: Namespaceify tcp_keepalive_time sysctl knob
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 17:32:09 -05:00
Yuchung Cheng
dd52bc2b4e tcp: fix Fast Open snmp over-counting bug
Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times
when the handshake experiences multiple SYN timeouts.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:51:12 -05:00
Yuchung Cheng
0e45f4da59 tcp: disable Fast Open on timeouts after handshake
Some middle-boxes black-hole the data after the Fast Open handshake
(https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf).
The exact reason is unknown. The work-around is to disable Fast Open
temporarily after multiple recurring timeouts with few or no data
delivered in the established state.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-20 10:51:12 -05:00
Richard Sailer
7533ce3055 tcp: change type of alive from int to bool
The alive parameter of tcp_orphan_retries, indicates
whether the connection is assumed alive or not.
In the function and all places calling it is used as a boolean value.

Therefore this changes the type of alive to bool in the function
definition and all calling locations.

Since tcp_orphan_tries is a tcp_timer.c local function no change in
any other file or header is necessary.

Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-12 05:15:03 -07:00
Eric Dumazet
a4e2405cc5 tcp: do not export tcp_init_xmit_timers()
After commit 900f65d361d3 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer
need to export tcp_init_xmit_timers()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-09 21:44:38 -07:00
Eric Dumazet
b8da51ebb1 tcp: introduce tcp_under_memory_pressure()
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-17 22:45:48 -04:00
Eric Dumazet
e520af48c7 tcp: add TCPWinProbe and TCPKeepAlive SNMP counters
Diagnosing problems related to Window Probes has been hard because
we lack a counter.

TCPWinProbe counts the number of ACK packets a sender has to send
at regular intervals to make sure a reverse ACK packet opening back
a window had not been lost.

TCPKeepAlive counts the number of ACK packets sent to keep TCP
flows alive (SO_KEEPALIVE)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-09 16:42:32 -04:00
Daniel Lee
2646c831c0 tcp: RFC7413 option support for Fast Open client
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies.  This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts.  If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.

The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.

Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-07 18:36:39 -04:00
Eric Dumazet
42cb80a235 inet: remove sk_listener parameter from syn_ack_timeout()
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00