This commit is a bug fix for the Linux TCP app-limited logic
used for collecting rate (bandwidth) samples.
Previously the app-limited logic only looked for "bubbles" of
silence in between application writes, by checking at the start
of each sendmsg. But "bubbles" of silence can also happen before
retransmits: e.g. bubbles can happen between an application write
and a retransmit, or between two retransmits.
Retransmits are triggered by ACKs or timers. So this commit checks
for bubbles of app-limited silence upon ACKs or timers.
Why does this commit check for app-limited state at the start of
ACKs and timer handling? Because at that point we know whether
inflight was fully using the cwnd. During processing the ACK or
timer event we often change the cwnd; after changing the cwnd we
can't know whether inflight was fully using the old cwnd.
Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
Change-Id: I37221506f5166877c2b110753d39bb0757985e68
(cherry-picked from 80d039a0a0)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
* refs/heads/tmp-f960b38:
Linux 4.14.159
of: unittest: fix memory leak in attach_node_and_children
raid5: need to set STRIPE_HANDLE for batch head
gpiolib: acpi: Add Terra Pad 1061 to the run_edge_events_on_boot_blacklist
kernel/module.c: wakeup processes in module_wq on module unload
gfs2: fix glock reference problem in gfs2_trans_remove_revoke
net/mlx5e: Fix SFF 8472 eeprom length
sunrpc: fix crash when cache_head become valid before update
workqueue: Fix missing kfree(rescuer) in destroy_workqueue()
blk-mq: make sure that line break can be printed
mfd: rk808: Fix RK818 ID template
ext4: fix a bug in ext4_wait_for_tail_page_commit
mm/shmem.c: cast the type of unmap_start to u64
firmware: qcom: scm: Ensure 'a0' status code is treated as signed
ext4: work around deleting a file with i_nlink == 0 safely
powerpc: Fix vDSO clock_getres()
powerpc: Avoid clang warnings around setjmp and longjmp
ath10k: fix fw crash by moving chip reset after napi disabled
media: vimc: fix component match compare
mlxsw: spectrum_router: Refresh nexthop neighbour when it becomes dead
power: supply: cpcap-battery: Fix signed counter sample register
x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
e100: Fix passing zero to 'PTR_ERR' warning in e100_load_ucode_wait
drbd: Change drbd_request_detach_interruptible's return type to int
scsi: lpfc: Correct code setting non existent bits in sli4 ABORT WQE
scsi: lpfc: Cap NPIV vports to 256
omap: pdata-quirks: remove openpandora quirks for mmc3 and wl1251
phy: renesas: rcar-gen3-usb2: Fix sysfs interface of "role"
iio: adis16480: Add debugfs_reg_access entry
xhci: make sure interrupts are restored to correct state
xhci: Fix memory leak in xhci_add_in_port()
scsi: qla2xxx: Fix message indicating vectors used by driver
scsi: qla2xxx: Always check the qla2x00_wait_for_hba_online() return value
scsi: qla2xxx: Fix qla24xx_process_bidir_cmd()
scsi: qla2xxx: Fix session lookup in qlt_abort_work()
scsi: qla2xxx: Fix DMA unmap leak
scsi: zfcp: trace channel log even for FCP command responses
block: fix single range discard merge
reiserfs: fix extended attributes on the root directory
ext4: Fix credit estimate for final inode freeing
quota: fix livelock in dquot_writeback_dquots
ext2: check err when partial != NULL
quota: Check that quota is not dirty before release
video/hdmi: Fix AVI bar unpack
powerpc/xive: Skip ioremap() of ESB pages for LSI interrupts
powerpc: Allow flush_icache_range to work across ranges >4GB
powerpc/xive: Prevent page fault issues in the machine crash handler
powerpc: Allow 64bit VDSO __kernel_sync_dicache to work across ranges >4GB
ppdev: fix PPGETTIME/PPSETTIME ioctls
ARM: dts: omap3-tao3530: Fix incorrect MMC card detection GPIO polarity
mmc: host: omap_hsmmc: add code for special init of wl1251 to get rid of pandora_wl1251_init_card
pinctrl: samsung: Fix device node refcount leaks in S3C64xx wakeup controller init
pinctrl: samsung: Fix device node refcount leaks in init code
pinctrl: samsung: Fix device node refcount leaks in S3C24xx wakeup controller init
pinctrl: samsung: Add of_node_put() before return in error path
ACPI: PM: Avoid attaching ACPI PM domain to certain devices
ACPI: bus: Fix NULL pointer check in acpi_bus_get_private_data()
ACPI: OSL: only free map once in osl.c
cpufreq: powernv: fix stack bloat and hard limit on number of CPUs
PM / devfreq: Lock devfreq in trans_stat_show
intel_th: pci: Add Tiger Lake CPU support
intel_th: pci: Add Ice Lake CPU support
intel_th: Fix a double put_device() in error path
cpuidle: Do not unset the driver if it is there already
media: cec.h: CEC_OP_REC_FLAG_ values were swapped
media: radio: wl1273: fix interrupt masking on release
media: bdisp: fix memleak on release
s390/mm: properly clear _PAGE_NOEXEC bit when it is not supported
ar5523: check NULL before memcpy() in ar5523_cmd()
cgroup: pids: use atomic64_t for pids->limit
blk-mq: avoid sysfs buffer overflow with too many CPU cores
ASoC: Jack: Fix NULL pointer dereference in snd_soc_jack_report
workqueue: Fix pwq ref leak in rescuer_thread()
workqueue: Fix spurious sanity check failures in destroy_workqueue()
dm zoned: reduce overhead of backing device checks
hwrng: omap - Fix RNG wait loop timeout
watchdog: aspeed: Fix clock behaviour for ast2600
md/raid0: Fix an error message in raid0_make_request()
ALSA: hda - Fix pending unsol events at shutdown
ovl: relax WARN_ON() on rename to self
lib: raid6: fix awk build warnings
rtlwifi: rtl8192de: Fix missing enable interrupt flag
rtlwifi: rtl8192de: Fix missing callback that tests for hw release of buffer
rtlwifi: rtl8192de: Fix missing code to retrieve RX buffer address
btrfs: record all roots for rename exchange on a subvol
Btrfs: send, skip backreference walking for extents with many references
btrfs: Remove btrfs_bio::flags member
Btrfs: fix negative subv_writers counter and data space leak after buffered write
btrfs: use refcount_inc_not_zero in kill_all_nodes
btrfs: check page->mapping when loading free space cache
usb: dwc3: ep0: Clear started flag on completion
virtio-balloon: fix managed page counts when migrating pages between zones
mtd: spear_smi: Fix Write Burst mode
tpm: add check after commands attribs tab allocation
usb: mon: Fix a deadlock in usbmon between mmap and read
usb: core: urb: fix URB structure initialization function
USB: adutux: fix interface sanity check
USB: serial: io_edgeport: fix epic endpoint lookup
USB: idmouse: fix interface sanity checks
USB: atm: ueagle-atm: add missing endpoint check
iio: humidity: hdc100x: fix IIO_HUMIDITYRELATIVE channel reporting
ARM: dts: pandora-common: define wl1251 as child node of mmc3
xhci: handle some XHCI_TRUST_TX_LENGTH quirks cases as default behaviour.
xhci: Increase STS_HALT timeout in xhci_suspend()
usb: xhci: only set D3hot for pci device
staging: gigaset: add endpoint-type sanity check
staging: gigaset: fix illegal free on probe errors
staging: gigaset: fix general protection fault on probe
staging: rtl8712: fix interface sanity check
staging: rtl8188eu: fix interface sanity check
usb: Allow USB device to be warm reset in suspended state
USB: documentation: flags on usb-storage versus UAS
USB: uas: heed CAPACITY_HEURISTICS
USB: uas: honor flag to avoid CAPACITY16
media: venus: remove invalid compat_ioctl32 handler
scsi: qla2xxx: Fix driver unload hang
usb: gadget: pch_udc: fix use after free
usb: gadget: configfs: Fix missing spin_lock_init()
appletalk: Set error code if register_snap_client failed
appletalk: Fix potential NULL pointer dereference in unregister_snap_client
KVM: x86: fix out-of-bounds write in KVM_GET_EMULATED_CPUID (CVE-2019-19332)
ASoC: rsnd: fixup MIX kctrl registration
binder: Handle start==NULL in binder_update_page_range()
thermal: Fix deadlock in thermal thermal_zone_device_check
iomap: Fix pipe page leakage during splicing
RDMA/qib: Validate ->show()/store() callbacks before calling them
spi: atmel: Fix CS high support
crypto: user - fix memory leak in crypto_report
crypto: ecdh - fix big endian bug in ECC library
crypto: ccp - fix uninitialized list head
crypto: af_alg - cast ki_complete ternary op to int
crypto: crypto4xx - fix double-free in crypto4xx_destroy_sdr
KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES
KVM: x86: do not modify masked bits of shared MSRs
KVM: arm/arm64: vgic: Don't rely on the wrong pending table
drm/i810: Prevent underflow in ioctl
jbd2: Fix possible overflow in jbd2_log_space_left()
kernfs: fix ino wrap-around detection
can: slcan: Fix use-after-free Read in slcan_open
tty: vt: keyboard: reject invalid keycodes
CIFS: Fix SMB2 oplock break processing
CIFS: Fix NULL-pointer dereference in smb2_push_mandatory_locks
x86/PCI: Avoid AMD FCH XHCI USB PME# from D0 defect
Input: Fix memory leak in psxpad_spi_probe
coresight: etm4x: Fix input validation for sysfs.
Input: goodix - add upside-down quirk for Teclast X89 tablet
Input: synaptics-rmi4 - don't increment rmiaddr for SMBus transfers
Input: synaptics-rmi4 - re-enable IRQs in f34v7_do_reflash
Input: synaptics - switch another X1 Carbon 6 to RMI/SMbus
ALSA: hda - Add mute led support for HP ProBook 645 G4
ALSA: pcm: oss: Avoid potential buffer overflows
ALSA: hda/realtek - Dell headphone has noise on unmute for ALC236
fuse: verify attributes
fuse: verify nlink
sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
tcp: exit if nothing to retransmit on RTO timeout
net: aquantia: fix RSS table and key sizes
media: vimc: fix start stream when link is disabled
ARM: dts: sunxi: Fix PMU compatible strings
usb: mtu3: fix dbginfo in qmu_tx_zlp_error_handler
mlx4: Use snprintf instead of complicated strcpy
IB/hfi1: Close VNIC sdma_progress sleep window
IB/hfi1: Ignore LNI errors before DC8051 transitions to Polling state
mlxsw: spectrum_router: Relax GRE decap matching check
firmware: qcom: scm: fix compilation error when disabled
media: stkwebcam: Bugfix for wrong return values
tty: Don't block on IO when ldisc change is pending
nfsd: Return EPERM, not EACCES, in some SETATTR cases
MIPS: OCTEON: cvmx_pko_mem_debug8: use oldest forward compatible definition
clk: renesas: r8a77995: Correct parent clock of DU
powerpc/math-emu: Update macros from GCC
pstore/ram: Avoid NULL deref in ftrace merging failure path
net/mlx4_core: Fix return codes of unsupported operations
dlm: fix invalid cluster name warning
ARM: dts: realview: Fix some more duplicate regulator nodes
clk: sunxi-ng: h3/h5: Fix CSI_MCLK parent
ARM: dts: pxa: clean up USB controller nodes
mtd: fix mtd_oobavail() incoherent returned value
kbuild: fix single target build for external module
modpost: skip ELF local symbols during section mismatch check
tcp: fix SNMP TCP timeout under-estimation
tcp: fix SNMP under-estimation on failed retransmission
tcp: fix off-by-one bug on aborting window-probing socket
ARM: dts: realview-pbx: Fix duplicate regulator nodes
ARM: dts: mmp2: fix the gpio interrupt cell number
net/x25: fix null_x25_address handling
net/x25: fix called/calling length calculation in x25_parse_address_block
arm64: dts: meson-gxl-khadas-vim: fix GPIO lines names
arm64: dts: meson-gxbb-odroidc2: fix GPIO lines names
arm64: dts: meson-gxbb-nanopi-k2: fix GPIO lines names
arm64: dts: meson-gxl-libretech-cc: fix GPIO lines names
ARM: OMAP1/2: fix SoC name printing
ASoC: au8540: use 64-bit arithmetic instead of 32-bit
nfsd: fix a warning in __cld_pipe_upcall()
ARM: debug: enable UART1 for socfpga Cyclone5
dlm: NULL check before kmem_cache_destroy is not needed
ARM: dts: sun8i: v3s: Change pinctrl nodes to avoid warning
ARM: dts: sun5i: a10s: Fix HDMI output DTC warning
ASoC: rsnd: tidyup registering method for rsnd_kctrl_new()
lockd: fix decoding of TEST results
i2c: imx: don't print error message on probe defer
serial: imx: fix error handling in console_setup
altera-stapl: check for a null key before strcasecmp'ing it
dma-mapping: fix return type of dma_set_max_seg_size()
sparc: Correct ctx->saw_frame_pointer logic.
f2fs: fix to allow node segment for GC by ioctl path
ARM: dts: rockchip: Assign the proper GPIO clocks for rv1108
ARM: dts: rockchip: Fix the PMU interrupt number for rv1108
f2fs: change segment to section in f2fs_ioc_gc_range
f2fs: fix count of seg_freed to make sec_freed correct
ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()
usb: dwc3: don't log probe deferrals; but do log other error codes
usb: dwc3: debugfs: Properly print/set link state for HS
dmaengine: dw-dmac: implement dma protection control setting
dmaengine: coh901318: Remove unused variable
dmaengine: coh901318: Fix a double-lock bug
media: cec: report Vendor ID after initialization
media: pulse8-cec: return 0 when invalidating the logical address
ARM: dts: exynos: Use Samsung SoC specific compatible for DWC2 module
rtc: dt-binding: abx80x: fix resistance scale
rtc: max8997: Fix the returned value in case of error in 'max8997_rtc_read_alarm()'
math-emu/soft-fp.h: (_FP_ROUND_ZERO) cast 0 to void to fix warning
net/smc: use after free fix in smc_wr_tx_put_slot()
MIPS: OCTEON: octeon-platform: fix typing
iomap: sub-block dio needs to zeroout beyond EOF
net-next/hinic:fix a bug in set mac address
regulator: Fix return value of _set_load() stub
clk: rockchip: fix ID of 8ch clock of I2S1 for rk3328
clk: rockchip: fix I2S1 clock gate register for rk3328
mm/vmstat.c: fix NUMA statistics updates
Staging: iio: adt7316: Fix i2c data reading, set the data field
pinctrl: qcom: ssbi-gpio: fix gpio-hog related boot issues
crypto: bcm - fix normal/non key hash algorithm failure
crypto: ecc - check for invalid values in the key verification test
scsi: zfcp: drop default switch case which might paper over missing case
net: dsa: mv88e6xxx: Work around mv886e6161 SERDES missing MII_PHYSID2
MIPS: SiByte: Enable ZONE_DMA32 for LittleSur
dlm: fix missing idr_destroy for recover_idr
ARM: dts: rockchip: Fix rk3288-rock2 vcc_flash name
clk: rockchip: fix rk3188 sclk_mac_lbtest parameter ordering
clk: rockchip: fix rk3188 sclk_smc gate data
i40e: don't restart nway if autoneg not supported
rtc: s3c-rtc: Avoid using broken ALMYEAR register
net: ethernet: ti: cpts: correct debug for expired txq skb
extcon: max8997: Fix lack of path setting in USB device mode
dlm: fix possible call to kfree() for non-initialized pointer
clk: sunxi-ng: a64: Fix gate bit of DSI DPHY
net/mlx5: Release resource on error flow
ARM: 8813/1: Make aligned 2-byte getuser()/putuser() atomic on ARMv6+
iwlwifi: mvm: Send non offchannel traffic via AP sta
iwlwifi: mvm: synchronize TID queue removal
cxgb4vf: fix memleak in mac_hlist initialization
serial: core: Allow processing sysrq at port unlock time
i2c: core: fix use after free in of_i2c_notify
net: ep93xx_eth: fix mismatch of request_mem_region in remove
rsxx: add missed destroy_workqueue calls in remove
ALSA: pcm: Fix stream lock usage in snd_pcm_period_elapsed()
sched/core: Avoid spurious lock dependencies
Input: cyttsp4_core - fix use after free bug
xfrm: release device reference for invalid state
NFC: nxp-nci: Fix NULL pointer dereference after I2C communication error
audit_get_nd(): don't unlock parent too early
exportfs_decode_fh(): negative pinned may become positive without the parent locked
iwlwifi: pcie: don't consider IV len in A-MSDU
RDMA/hns: Correct the value of HNS_ROCE_HEM_CHUNK_LEN
autofs: fix a leak in autofs_expire_indirect()
serial: ifx6x60: add missed pm_runtime_disable
serial: serial_core: Perform NULL checks for break_ctl ops
serial: pl011: Fix DMA ->flush_buffer()
tty: serial: msm_serial: Fix flow control
tty: serial: fsl_lpuart: use the sg count from dma_map_sg
usb: gadget: u_serial: add missing port entry locking
arm64: tegra: Fix 'active-low' warning for Jetson TX1 regulator
rsi: release skb if rsi_prepare_beacon fails
ANDROID: staging: android: ion: Fix build when CONFIG_ION_SYSTEM_HEAP=n
ANDROID: staging: android: ion: Expose total heap and pool sizes via sysfs
UPSTREAM: include/linux/slab.h: fix sparse warning in kmalloc_type()
UPSTREAM: mm, slab: shorten kmalloc cache names for large sizes
UPSTREAM: mm, proc: add KReclaimable to /proc/meminfo
BACKPORT: mm: rename and change semantics of nr_indirectly_reclaimable_bytes
UPSTREAM: dcache: allocate external names from reclaimable kmalloc caches
BACKPORT: mm, slab/slub: introduce kmalloc-reclaimable caches
UPSTREAM: mm, slab: combine kmalloc_caches and kmalloc_dma_caches
ANDROID: kbuild: disable SCS by default in allmodconfig
ANDROID: arm64: cuttlefish_defconfig: enable LTO, CFI, and SCS
BACKPORT: FROMLIST: arm64: implement Shadow Call Stack
FROMLIST: arm64: disable SCS for hypervisor code
BACKPORT: FROMLIST: arm64: vdso: disable Shadow Call Stack
FROMLIST: arm64: preserve x18 when CPU is suspended
FROMLIST: arm64: reserve x18 from general allocation with SCS
FROMLIST: arm64: disable function graph tracing with SCS
FROMLIST: scs: add support for stack usage debugging
FROMLIST: scs: add accounting
FROMLIST: add support for Clang's Shadow Call Stack (SCS)
FROMLIST: arm64: kernel: avoid x18 in __cpu_soft_restart
FROMLIST: arm64: kvm: stop treating register x18 as caller save
FROMLIST: arm64/lib: copy_page: avoid x18 register in assembler code
FROMLIST: arm64: mm: avoid x18 in idmap_kpti_install_ng_mappings
ANDROID: use non-canonical CFI jump tables
ANDROID: arm64: add __nocfi to __apply_alternatives
ANDROID: arm64: add __pa_function
ANDROID: arm64: allow ThinLTO to be selected
ANDROID: soc/tegra: disable ARCH_TEGRA_210_SOC with LTO
FROMLIST: arm64: fix alternatives with LLVM's integrated assembler
ANDROID: irqchip/gic-v3: rename gic_of_init to work around a ThinLTO+CFI bug
ANDROID: kbuild: limit LTO inlining
ANDROID: kbuild: merge module sections with LTO
ANDROID: init: ensure initcall ordering with LTO
Revert "ANDROID: HACK: init: ensure initcall ordering with LTO"
ANDROID: add support for ThinLTO
ANDROID: Switch to LLD
ANDROID: clang: update to 10.0.1
ANDROID: arm64: add atomic_ll_sc.o to obj-y if using lld
ANDROID: enable ARM64_ERRATUM_843419 by default with LTO_CLANG
ANDROID: kbuild: allow lld to be used with CONFIG_LTO_CLANG
ANDROID: Makefile: set -Qunused-arguments sooner
BACKPORT: FROMLIST: Makefile: lld: tell clang to use lld
BACKPORT: FROMLIST: Makefile: lld: set -O2 linker flag when linking with LLD
ANDROID: scripts/Kbuild: add ld-name support for ld.lld
UPSTREAM: bpf: permit multiple bpf attachments for a single perf event
UPSTREAM: bpf: use the same condition in perf event set/free bpf handler
UPSTREAM: bpf: multi program support for cgroup+bpf
BACKPORT: serdev: make synchronous write return bytes written
UPSTREAM: gnss: serial: fix synchronous write timeout
UPSTREAM: gnss: fix potential error pointer dereference
BACKPORT: gnss: add receiver type support
UPSTREAM: dt-bindings: add generic gnss binding
UPSTREAM: gnss: add generic serial driver
ANDROID: cuttlefish_defconfig: Enable CONFIG_SERIAL_DEV_BUS
ANDROID: cuttlefish_defconfig: Enable CONFIG_GNSS
BACKPORT: gnss: add GNSS receiver subsystem
UPSTREAM: arm64: Validate tagged addresses in access_ok() called from kernel threads
BACKPORT: ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer
fs/lock: skip lock owner pid translation in case we are in init_pid_ns
f2fs: stop GC when the victim becomes fully valid
f2fs: expose main_blkaddr in sysfs
f2fs: choose hardlimit when softlimit is larger than hardlimit in f2fs_statfs_project()
f2fs: Fix deadlock in f2fs_gc() context during atomic files handling
f2fs: show f2fs instance in printk_ratelimited
f2fs: fix potential overflow
f2fs: fix to update dir's i_pino during cross_rename
f2fs: support aligned pinned file
f2fs: avoid kernel panic on corruption test
f2fs: fix wrong description in document
f2fs: cache global IPU bio
f2fs: fix to avoid memory leakage in f2fs_listxattr
f2fs: check total_segments from devices in raw_super
f2fs: update multi-dev metadata in resize_fs
f2fs: mark recovery flag correctly in read_raw_super_block()
f2fs: fix to update time in lazytime mode
vfs: don't allow writes to swap files
mm: set S_SWAPFILE on blockdev swap devices
Conflicts:
drivers/Makefile
drivers/staging/android/ion/ion.c
drivers/staging/android/ion/ion.h
drivers/staging/android/ion/ion_page_pool.c
drivers/usb/dwc3/core.c
drivers/usb/dwc3/debugfs.c
drivers/usb/dwc3/ep0.c
fs/f2fs/data.c
include/linux/mmzone.h
mm/vmstat.c
Discarded below patches, as usb patches not applicable and block patch
causing stability issues:
usb: dwc3: ep0: Clear started flag on completion
usb: dwc3: don't log probe deferrals; but do log other error codes
block: fix single range discard merge
Fixed build errors in below files:
drivers/gpu/msm/kgsl_pool.c
drivers/staging/android/ion/ion_page_pool.c
kernel/taskstats.c
Fixed bootup issue in:
arch/arm64/mm/proc.s
Change-Id: I0a16824c251c14c63af78f9cfd9ede5e82c427fc
Signed-off-by: Srinivasarao P <spathi@codeaurora.org>
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
Two upstream commits squashed together for v4.14 stable :
commit 88f8598d0a302a08380eadefd09b9f5cb1c4c428 upstream.
Previously TCP only warns if its RTO timer fires and the
retransmission queue is empty, but it'll cause null pointer
reference later on. It's better to avoid such catastrophic failure
and simply exit with a warning.
Squashed with "tcp: refactor tcp_retransmit_timer()" :
commit 0d580fbd2db084a5c96ee9c00492236a279d5e0f upstream.
It appears linux-4.14 stable needs a backport of commit
88f8598d0a30 ("tcp: exit if nothing to retransmit on RTO timeout")
Since tcp_rtx_queue_empty() is not in pre 4.15 kernels,
let's refactor tcp_retransmit_timer() to only use tcp_rtx_queue_head()
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e1561fe2dd69dc5dddd69bd73aa65355bdfb048b ]
Previously the SNMP TCPTIMEOUTS counter has inconsistent accounting:
1. It counts all SYN and SYN-ACK timeouts
2. It counts timeouts in other states except recurring timeouts and
timeouts after fast recovery or disorder state.
Such selective accounting makes analysis difficult and complicated. For
example the monitoring system needs to collect many other SNMP counters
to infer the total amount of timeout events. This patch makes TCPTIMEOUTS
counter simply counts all the retransmit timeout (SYN or data or FIN).
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3976535af0cb9fe34a55f2ffb8d7e6b39a2f8188 ]
Previously there is an off-by-one bug on determining when to abort
a stalled window-probing socket. This patch fixes that so it is
consistent with tcp_write_timeout().
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 967c05aee439e6e5d7d805e195b3a20ef5c433d6 upstream.
If mtu probing is enabled tcp_mtu_probing() could very well end up
with a too small MSS.
Use the new sysctl tcp_min_snd_mss to make sure MSS search
is performed in an acceptable range.
CVE-2019-11479 -- tcp mss hardcoded to 48
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
It has been observed that default values for some of key tcp/ip
parameters are affecting the tput/performance of the system. Hence
extending configuration capabilities to TCP/Ip stack through
sysctl interface
Change-Id: I4287e9103769535f43e0934bac08435a524ee6a4
CRs-Fixed: 507581
Signed-off-by: Ravi Joshi <ravij@codeaurora.org>
Signed-off-by: Ganesh Babu Kumaravel <kganesh@codeaurora.org>
Signed-off-by: Mohit Khanna <mkhannaqca@codeaurora.org>
Signed-off-by: Manjunathappa Prakash <prakashpm@codeaurora.org>
[ Upstream commit e05836ac07c77dd90377f8c8140bce2a44af5fe7 ]
When the connection is aborted, there is no point in
keeping the packets on the write queue until the connection
is closed.
Similar to a27fd7a8ed38 ('tcp: purge write queue upon RST'),
this is essential for a correct MSG_ZEROCOPY implementation,
because userspace cannot call close(fd) before receiving
zerocopy signals even when the connection is aborted.
Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4ee806d51176ba7b8ff1efd81f271d7252e03a1d ]
When a tcp socket is closed, if it detects that its net namespace is
exiting, close immediately and do not wait for FIN sequence.
For normal sockets, a reference is taken to their net namespace, so it will
never exit while the socket is open. However, kernel sockets do not take a
reference to their net namespace, so it may begin exiting while the kernel
socket is still open. In this case if the kernel socket is a tcp socket,
it will stay open trying to complete its close sequence. The sock's dst(s)
hold a reference to their interface, which are all transferred to the
namespace's loopback interface when the real interfaces are taken down.
When the namespace tries to take down its loopback interface, it hangs
waiting for all references to the loopback interface to release, which
results in messages like:
unregister_netdevice: waiting for lo to become free. Usage count = 1
These messages continue until the socket finally times out and closes.
Since the net namespace cleanup holds the net_mutex while calling its
registered pernet callbacks, any new net namespace initialization is
blocked until the current net namespace finishes exiting.
After this change, the tcp socket notices the exiting net namespace, and
closes immediately, releasing its dst(s) and their reference to the
loopback interface, which lets the net namespace continue exiting.
Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=97811
Signed-off-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 4688eb7cf3ae2c2721d1dacff5c1384cba47d176 ]
Only the retransmit timer currently refreshes tcp_mstamp
We should do the same for delayed acks and keepalives.
Even if RFC 7323 does not request it, this is consistent to what linux
did in the past, when TS values were based on jiffies.
Fixes: 385e20706fac ("tcp: use tp->tcp_mstamp in output path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Mike Maloney <maloney@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Mike Maloney <maloney@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The UDP offload conflict is dealt with by simply taking what is
in net-next where we have removed all of the UFO handling code
entirely.
The TCP conflict was a case of local variables in a function
being removed from both net and net-next.
In netvsc we had an assignment right next to where a missing
set of u64 stats sync object inits were added.
Signed-off-by: David S. Miller <davem@davemloft.net>
prequeue is a tcp receive optimization that moves part of rx processing
from bh to process context.
This only works if the socket being processed belongs to a process that
is blocked in recv on that socket.
In practice, this doesn't happen anymore that often because nowadays
servers tend to use an event driven (epoll) model.
Even normal client applications (web browsers) commonly use many tcp
connections in parallel.
This has measureable impact only in netperf (which uses plain recv and
thus allows prequeue use) from host to locally running vm (~4%), however,
there were no changes when using netperf between two physical hosts with
ixgbe interfaces.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the mentioned commit, some of our packetdrill tests became flaky.
TCP_SYNCNT socket option can limit the number of SYN retransmits.
retransmits_timed_out() has to compare times computations based on
local_clock() while timers are based on jiffies. With NTP adjustments
and roundings we can observe 999 ms delay for 1000 ms timers.
We end up sending one extra SYN packet.
Gimmick added in commit 6fa12c850314 ("Revert Backoff [v3]: Calculate
TCP's connection close threshold as a time value") makes no
real sense for TCP_SYN_SENT sockets where no RTO backoff can happen at
all.
Lets use a simpler logic for TCP_SYN_SENT sockets and remove @syn_set
parameter from retransmits_timed_out()
Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP_USER_TIMEOUT is still converted to jiffies value in
icsk_user_timeout
So we need to make a conversion for the cases HZ != 1000
Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP Timestamps option is defined in RFC 7323
Traditionally on linux, it has been tied to the internal
'jiffies' variable, because it had been a cheap and good enough
generator.
For TCP flows on the Internet, 1 ms resolution would be much better
than 4ms or 10ms (HZ=250 or HZ=100 respectively)
For TCP flows in the DC, Google has used usec resolution for more
than two years with great success [1]
Receive size autotuning (DRS) is indeed more precise and converges
faster to optimal window size.
This patch converts tp->tcp_mstamp to a plain u64 value storing
a 1 usec TCP clock.
This choice will allow us to upstream the 1 usec TS option as
discussed in IETF 97.
[1] https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use tcp_jiffies32 instead of tcp_time_stamp, since
tcp_time_stamp will soon be only used for TCP TS option.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use tcp_jiffies32 instead of tcp_time_stamp to feed
tp->lsndtime.
tcp_time_stamp will soon be a litle bit more expensive
than simply reading 'jiffies'.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Idea is to later convert tp->tcp_mstamp to a full u64 counter
using usec resolution, so that we can later have fine
grained TCP TS clock (RFC 7323), regardless of HZ value.
We try to refresh tp->tcp_mstamp only when necessary.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.
However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
to use BBR for the few TCP flows they initiate/terminate.
This patch implements an automatic fallback to internal pacing.
Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.
One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.
Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.
Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.
If MQ+pfifo_fast is used on the NIC :
$ sar -n DEV 1 5 | grep eth
14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0
Now use MQ+FQ :
lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq
$ sar -n DEV 1 5 | grep eth
14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0
As expected, number of interrupts per second is very different.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christoph Paasch from Apple found another firewall issue for TFO:
After successful 3WHS using TFO, server and client starts to exchange
data. Afterwards, a 10s idle time occurs on this connection. After that,
firewall starts to drop every packet on this connection.
The fix for this issue is to extend existing firewall blackhole detection
logic in tcp_write_timeout() by removing the mss check.
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Vyukov reported a divide by 0 triggered by syzkaller, exploiting
tcp_disconnect() path that was never really considered and/or used
before syzkaller ;)
I was not able to reproduce the bug, but it seems issues here are the
three possible actions that assumed they would never trigger on a
listener.
1) tcp_write_timer_handler
2) tcp_delack_timer_handler
3) MTU reduction
Only IPv6 MTU reduction was properly testing TCP_CLOSE and TCP_LISTEN
states from tcp_v6_mtu_reduced()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes the support of RFC5827 early retransmit (i.e.,
fast recovery on small inflight with <3 dupacks) because it is
subsumed by the new RACK loss detection. More specifically when
RACK receives DUPACKs, it'll arm a reordering timer to start fast
recovery after a quarter of (min)RTT, hence it covers the early
retransmit except RACK does not limit itself to specific inflight
or dupack numbers.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch makes RACK install a reordering timer when it suspects
some packets might be lost, but wants to delay the decision
a little bit to accomodate reordering.
It does not create a new timer but instead repurposes the existing
RTO timer, because both are meant to retransmit packets.
Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
the RACK timing check fails. The wait time is set to
RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
This translates to expecting a packet (Packet) should take
(RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
When there are multiple packets that need a timer, we use one timer
with the maximum timeout. Therefore the timer conservatively uses
the maximum window to expire N packets by one timeout, instead of
N timeouts to expire N packets sent at different times.
The fudge factor is 2 jiffies to ensure when the timer fires, all
the suspected packets would exceed the deadline and be marked lost
by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
clock may tick between calling icsk_reset_xmit_timer(timeout) and
actually hang the timer. The next jiffy is to lower-bound the timeout
to 2 jiffies when reo_wnd is < 1ms.
When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
in Recovery we'll enter fast recovery and force fast retransmit.
This is very similar to the early retransmit (RFC5827) except RACK
is not constrained to only enter recovery for small outstanding
flights.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Direct call of tcp_set_keepalive() function from protocol-agnostic
sock_setsockopt() function in net/core/sock.c violates network
layering. And newly introduced protocol (SMC-R) will need its own
keepalive function. Therefore, add "keepalive" function pointer
to "struct proto", and call it from sock_setsockopt() via this pointer.
Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
Reviewed-by: Utz Bacher <utz.bacher@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.
Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The current code changes txhash (flowlables) on every retransmitted
SYN/ACK, but only after the 2nd retransmitted SYN and only after
tcp_retries1 RTO retransmits.
With this patch:
1) txhash is changed with every SYN retransmits
2) txhash is changed with every RTO.
The result is that we can start re-routing around failed (or very
congested paths) as soon as possible. Otherwise application health
checks may fail and the connection may be terminated before we start
to change txhash.
v4: Removed sysctl, txhash is changed for all RTOs
v3: Removed text saying default value of sysctl is 0 (it is 100)
v2: Added sysctl documentation and cleaned code
Tested with packetdrill tests
Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the TFO socket is accepted right off SYN-data, the socket
owner can call getsockopt(TCP_INFO) to collect ongoing SYN-ACK
retransmission or timeout stats (i.e., tcpi_total_retrans,
tcpi_retransmits). Currently those stats are only updated
upon handshake completes. This patch fixes it.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This adds kernel-doc style descriptions for 6 functions and
fixes 1 typo.
Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.
Many SNMP updates were assuming that BH (and preemption) was disabled.
Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()
Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rename NET_INC_STATS_BH() to __NET_INC_STATS()
and NET_ADD_STATS_BH() to __NET_ADD_STATS()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
This was fine back in the days when TSO/GSO were emerging, with their
bugs, but we believe the dark age is over.
Keeping big packets in write queues, but also in stack traversal
has a lot of benefits.
- Less memory overhead, because write queues have less skbs
- Less cpu overhead at ACK processing.
- Better SACK processing, as lot of studies mentioned how
awful linux was at this ;)
- Less cpu overhead to send the rtx packets
(IP stack traversal, netfilter traversal, drivers...)
- Better latencies in presence of losses.
- Smaller spikes in fq like packet schedulers, as retransmits
are not constrained by TCP Small Queues.
1 % packet losses are common today, and at 100Gbit speeds, this
translates to ~80,000 losses per second.
Losses are often correlated, and we see many retransmit events
leading to 1-MSS train of packets, at the time hosts are already
under stress.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the final part required to namespaceify the tcp
keep alive mechanism.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is required to have full tcp keepalive mechanism namespace
support.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Different net namespaces might have different requirements as to
the keepalive time of tcp sockets. This might be required in cases
where different firewall rules are in place which require tcp
timeout sockets to be increased/decreased independently of the host.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix incrementing TCPFastOpenActiveFailed snmp stats multiple times
when the handshake experiences multiple SYN timeouts.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some middle-boxes black-hole the data after the Fast Open handshake
(https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf).
The exact reason is unknown. The work-around is to disable Fast Open
temporarily after multiple recurring timeouts with few or no data
delivered in the established state.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The alive parameter of tcp_orphan_retries, indicates
whether the connection is assumed alive or not.
In the function and all places calling it is used as a boolean value.
Therefore this changes the type of alive to bool in the function
definition and all calling locations.
Since tcp_orphan_tries is a tcp_timer.c local function no change in
any other file or header is necessary.
Signed-off-by: Richard Sailer <richard@weltraumpflege.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
After commit 900f65d361d3 ("tcp: move duplicate code from
tcp_v4_init_sock()/tcp_v6_init_sock()"), we no longer
need to export tcp_init_xmit_timers()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce an optimized version of sk_under_memory_pressure()
for TCP. Our intent is to use it in fast paths.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diagnosing problems related to Window Probes has been hard because
we lack a counter.
TCPWinProbe counts the number of ACK packets a sender has to send
at regular intervals to make sure a reverse ACK packet opening back
a window had not been lost.
TCPKeepAlive counts the number of ACK packets sent to keep TCP
flows alive (SO_KEEPALIVE)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fast Open has been using an experimental option with a magic number
(RFC6994). This patch makes the client by default use the RFC7413
option (34) to get and send Fast Open cookies. This patch makes
the client solicit cookies from a given server first with the
RFC7413 option. If that fails to elicit a cookie, then it tries
the RFC6994 experimental option. If that also fails, it uses the
RFC7413 option on all subsequent connect attempts. If the server
returns a Fast Open cookie then the client caches the form of the
option that successfully elicited a cookie, and uses that form on
later connects when it presents that cookie.
The idea is to gradually obsolete the use of experimental options as
the servers and clients upgrade, while keeping the interoperability
meanwhile.
Signed-off-by: Daniel Lee <Longinus00@gmail.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>