51572 Commits

Author SHA1 Message Date
Eric Dumazet
78ad27409c
BACKPORT: tcp: minor optimization in tcp_add_backlog()
If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
are not used. Defer their access to the point we need them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[cyberknight777: backport to 4.14]
Signed-off-by: Cyber Knight <cyberknight755@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-26 14:54:13 +07:00
Yaroslav Furman
48371293ce
net: silence 'quantum of class 10010 is big' warnings
Signed-off-by: Yaroslav Furman <yaro330@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:52 +07:00
Martin KaFai Lau
2c462f032a
UPSTREAM: tcp: Rename bictcp function prefix to cubictcp
The cubic functions in tcp_cubic.c are using the bictcp prefix as
in tcp_bic.c.  This patch gives it the proper name cubictcp
because the later patch will allow the bpf prog to directly
call the cubictcp implementation.  Renaming them will avoid
the name collision when trying to find the intended
one to call during bpf prog load time.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210325015155.1545532-1-kafai@fb.com
(cherry picked from commit d22f6ad18709e93622b6115ec9a5e42ed96b5d82)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:44 +07:00
Yejune Deng
615ca2f310
UPSTREAM: tcp_cubic: use memset and offsetof init
In bictcp_reset(), use memset and offsetof instead of = 0.

Signed-off-by: Yejune Deng <yejune.deng@gmail.com>
Link: https://lore.kernel.org/r/1610597696-128610-1-git-send-email-yejune.deng@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
(cherry picked from commit f4d133d86af7f39a0f5bdaf7a888ec7b84733b5e)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:44 +07:00
Eric Dumazet
c18e4e46f8
UPSTREAM: tcp_cubic: refactor code to perform a divide only when needed
Neal Cardwell suggested to not change ca->delay_min
and apply the ack delay cushion only when Hystart ACK train
is still under consideration. This should avoid a 64bit
divide unless needed.

Tested:

40Gbit(mlx4) testbed (with sch_fq as packet scheduler)

$ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  14815
  16280
  15293
  15563
  11574
  15145
  14789
  18548
  16972
  12520
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1396               0.0
$ dmesg | tail -10
[ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80
[ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160
[ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130
[ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130
[ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160
[ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216
[ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108
[ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130
[ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130
[ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152

10Gbit(bnx2x) testbed (with sch_fq as packet scheduler)

$ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -4000000; done;nstat|egrep "Hystart"
   7050
   7065
   7100
   6900
   7202
   7263
   7189
   6869
   7463
   7034
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       3199               0.0
$ dmesg | tail -10
[  176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264
[  179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444
[  181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436
[  183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326
[  185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128
[  187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367
[  190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444
[  192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152
[  194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239
[  196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399

Fixes: 42f3a8aaae66 ("tcp_cubic: tweak Hystart detection for short RTT flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Link: https://www.spinics.net/lists/netdev/msg621886.html
Link: https://www.spinics.net/lists/netdev/msg621797.html
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f278b99ca6b2d91a5744588d81bae297179b0d1f)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:43 +07:00
Eric Dumazet
7f9a499cba
UPSTREAM: tcp_cubic: make Hystart aware of pacing
For years we disabled Hystart ACK train detection at Google
because it was fooled by TCP pacing.

ACK train detection uses a simple heuristic, detecting if
we receive ACK past half the RTT, to exit slow start before
hitting the bottleneck and experience massive drops.

But pacing by design might delay packets up to RTT/2,
so we need to tweak the Hystart logic to be aware of this
extra delay.

Tested:
 Added a 100 usec delay at receiver.

Before:
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
   9117
   7057
   9553
   8300
   7030
   6849
   9533
  10126
   6876
   8473
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1230               0.0

After :
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
   9845
  10103
  10866
  11096
  11936
  11487
  11773
  12188
  11066
  11894
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       6462               0.0

Disabling Hystart ACK Train detection gives similar numbers

echo 2 >/sys/module/tcp_cubic/parameters/hystart_detect
nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  11173
  10954
  12455
  10627
  11578
  11583
  11222
  10880
  10665
  11366

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ede656e8465839530c3287c7f54adf75dc2b9563)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:43 +07:00
Eric Dumazet
1ce9880e7b
UPSTREAM: tcp_cubic: tweak Hystart detection for short RTT flows
After switching ca->delay_min to usec resolution, we exit
slow start prematurely for very low RTT flows, setting
snd_ssthresh to 20.

The reason is that delay_min is fed with RTT of small packet
trains. Then as cwnd is increased, TCP sends bigger TSO packets.

LRO/GRO aggregation and/or interrupt mitigation strategies
on receiver tend to inflate RTT samples.

Fix this by adding to delay_min the expected delay of
two TSO packets, given current pacing rate.

Tested:

Sender uses pfifo_fast qdisc

Before :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  11348
  11707
  11562
  11428
  11773
  11534
   9878
  11693
  10597
  10968
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       200                0.0

After :
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  14877
  14517
  15797
  18466
  17376
  14833
  17558
  17933
  16039
  18059
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1670               0.0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 42f3a8aaae66d31d87850fb4b02979a0fc5dc541)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:43 +07:00
Eric Dumazet
287f6994e9
UPSTREAM: tcp_cubic: switch bictcp_clock() to usec resolution
Current 1ms clock feeds ca->round_start, ca->delay_min,
ca->last_ack.

This is quite problematic for data-center flows, where delay_min
is way below 1 ms.

This means Hystart Train detection triggers every time jiffies value
is updated, since "((s32)(now - ca->round_start) > ca->delay_min >> 4)"
expression becomes true.

This kind of random behavior can be solved by reusing the existing
usec timestamp that TCP keeps in tp->tcp_mstamp

Note that a followup patch will tweak things a bit, because
during slow start, GRO aggregation on receivers naturally
increases the RTT as TSO packets gradually come to ~64KB size.

To recap, right after this patch CUBIC Hystart train detection
is more aggressive, since short RTT flows might exit slow start at
cwnd = 20, instead of being possibly unbounded.

Following patch will address this problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit cff04e2da308c522f654237b45dd64248fe8d1fa)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:43 +07:00
Eric Dumazet
67ff84a403
UPSTREAM: tcp_cubic: remove one conditional from hystart_update()
If we initialize ca->curr_rtt to ~0U, we do not need to test
for zero value in hystart_update()

We only read ca->curr_rtt if at least HYSTART_MIN_SAMPLES have
been processed, and thus ca->curr_rtt will have a sane value.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 35821fc2b41c51161e503bade3630603668dd3af)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:43 +07:00
Eric Dumazet
c3199b75de
UPSTREAM: tcp_cubic: optimize hystart_update()
We do not care which bit in ca->found is set.

We avoid accessing hystart and hystart_detect unless really needed,
possibly avoiding one cache line miss.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 473900a504e510cf9175876de8892ad1e3e7efab)
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:42 +07:00
Neal Cardwell
d39d810ccf
BACKPORT: FROMGIT: net-tcp_bbr: BBRv2 for Linux TCP
BBR v2 is an enhacement to the BBR v1 algorithm. It's designed to aim for lower
queues, lower loss, and better Reno/CUBIC coexistence than BBR v1.

BBR v2 maintains the core of BBR v1: an explicit model of the network path that
is two-dimensional, adapting to estimate the (a) maximum available bandwidth
and (b) maximum safe volume of data a flow can keep in-flight in the
network. It mains the estimated BDP as a core guide for estimating an
appropriate level of in-flight data.

BBR v2 makes several key enhancements:

o Its bandwidth-probing time scale is adapted, within bounds, to allow improved
coexistence with Reno and CUBIC. The bandwidth-probing time scale is (a)
extended dynamically based on estimated BDP to improve coexistence with
Reno/CUBIC; (b) bounded by an interactive wall-clock time-scale to be more
scalable and responsive than Reno and CUBIC.

o Rather than being largely agnostic to loss and ECN marks, it explicitly uses
loss and (DCTCP-style) ECN signals to maintain its model.

o It aims for lower losses than v1 by adjusting its model to attempt to stay
within loss rate and ECN mark rate bounds (loss_thresh and ecn_thresh,
respectively).

o It adapts to loss/ECN signals even when the application is running out of
data ("application-limited"), in case the "application-limited" flow is also
"network-limited" (the bw and/or inflight available to this flow is lower than
previously estimated when the flow ran out of data).

o It has a three-part model: the model explicit three tracks operating points,
where an operating point is a tuple: (bandwidth, inflight). The three operating
points are:

  o latest:        the latest measurement from the current round trip
  o upper bound:   robust, optimistic, long-term upper bound
  o lower bound:   robust, conservative, short-term lower bound

These are stored in the following state variables:

  o latest:  bw_latest, inflight_latest
  o lo:      bw_lo,     inflight_lo
  o hi:      bw_hi[2],  inflight_hi

To gain intuition about the meaning of the three operating points, it
may help to consider the analogs in CUBIC, which has a somewhat
analogous three-part model used by its probing state machine:

  BBR param     CUBIC param
  -----------   -------------
  latest     ~  cwnd
  lo         ~  ssthresh
  hi         ~  last_max_cwnd

The analogy is only a loose one, though, since the BBR operating
points are calculated differently, and are 2-dimensional (bw,inflight)
rather than CUBIC's one-dimensional notion of operating point
(inflight).

o It uses the three-part model to adapt the magnitude of its bandwidth
to match the estimated space available in the buffer, rather than (as
in BBR v1) assuming that it was always acceptable to place 0.25*BDP in
the bottleneck buffer when probing (commodity datacenter switches
commonly do not have that much buffer for WAN flows). When BBR v2
estimates it hit a buffer limit during probing, its bandwidth probing
then starts gently in case little space is still available in the
buffer, and the accelerates, slowly at first and then rapidly if it
can grow inflight without seeing congestion signals. In such cases,
probing is bounded by inflight_hi + inflight_probe, where
inflight_probe grows as: [0, 1, 2, 4, 8, 16,...]. This allows BBR to
keep losses low and bounded if a bottleneck remains congested, while
rapidly/scalably utilizing free bandwidth when it becomes available.

o It has a slightly revised state machine, to achieve the goals above.
    BBR_BW_PROBE_UP:    pushes up inflight to probe for bw/vol
    BBR_BW_PROBE_DOWN:  drain excess inflight from the queue
    BBR_BW_PROBE_CRUISE: use pipe, w/ headroom in queue/pipe
    BBR_BW_PROBE_REFILL: try refill the pipe again to 100%, leaving queue empty

o The estimated BDP: BBR v2 continues to maintain an estimate of the
path's two-way propagation delay, by tracking a windowed min_rtt, and
coordinating (on an as-ndeeded basis) to try to expose the two-way
propagation delay by draining the bottleneck queue.

BBR v2 continues to use its min_rtt and (currently-applicable) bandwidth
estimate to estimate the current bandwidth-delay product. The estimated BDP
still provides one important guideline for bounding inflight data. However,
because any min-filtered RTT and max-filtered bw inherently tend to both
overestimate, the estimated BDP is often too high; in this case loss or ECN
marks can ensue, in which case BBR v2 adjusts inflight_hi and inflight_lo to
adapt its sending rate and inflight down to match the available capacity of the
path.

o Space: Note that ICSK_CA_PRIV_SIZE increased. This is because BBR v2
adds 17 more u32 to struct bbr. However, there are 11 u32 fields from
BBR v1 that we can remove after switching to BBR v2:

        struct minmax bw;       /* Max recent delivery rate in pkts/uS << 24 */
        u32     rtt_cnt;            /* count of packet-timed rounds elapsed */
                ...
                packet_conservation:1,  /* use packet conservation? */
                ...
                lt_is_sampling:1,    /* taking long-term ("LT") samples now? */
                lt_rtt_cnt:7,        /* round trips in long-term interval */
                lt_use_bw:1;         /* use lt_bw as our bw estimate? */
        u32     lt_bw;               /* LT est delivery rate in pkts/uS << 24 */
        u32     lt_last_delivered;   /* LT intvl start: tp->delivered */
        u32     lt_last_stamp;       /* LT intvl start: tp->delivered_mstamp */
        u32     lt_last_lost;        /* LT intvl start: tp->lost */

  So ultimately BBR v2 uses 17-11 = 6 more u32 fields than v1.

o Code: BBR v2 reuses many pieces from BBR v1. But it omits the following
  significant pieces:

  o "packet conservation" (bbr_set_cwnd_to_recover_or_restore(),
    bbr_can_grow_inflight())
  o long-term bandwidth estimator ("policer mode")

  The code layout tries to keep BBR v2 code near the bottom of the
  file, so that v1-applicable code in the top does not accidentally
  refer to v2 code.

o Docs:
  See the following docs for more details and diagrams decsribing the BBR v2
  algorithm:
    https://datatracker.ietf.org/meeting/104/materials/slides-104-iccrg-an-update-on-bbr-00
    https://datatracker.ietf.org/meeting/102/materials/slides-102-iccrg-an-update-on-bbr-work-at-google-00

o Internal notes:
  For this upstream rebase, Neal started from:
    git show 6f6734c1c3c4:net/ipv4/tcp_bbr.c > net/ipv4/tcp_bbr.c
  then removed dev instrumentation (dynamic get/set for parameters)
  and code that was only used by BBRv1

Effort: net-tcp_bbr
Origin-9xx-SHA1: 2c84098e60bed6d67dde23cd7538c51dee273102
Change-Id: I125cf26ba2a7a686f2fa5e87f4c2afceb65f7a05
(cherry-picked from 90e22aa359)
[kdrag0n: Backported to k4.14 by removing reord_seen from bbr_debug's
          output as it's not mandatory and 4.14 doesn't have it]
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:42 +07:00
Eric Dumazet
8b97ce57c0
BACKPORT: tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
In EDT design, I made the mistake of using tcp_wstamp_ns
to store the last tcp_clock_ns() sample and to store the
pacing virtual timer.

This causes major regressions at high speed flows.

Introduce tcp_clock_cache to store last tcp_clock_ns().
This is needed because some arches have slow high-resolution
kernel time service.

tcp_wstamp_ns is only updated when a packet is sent.

Note that we can remove tcp_mstamp in the future since
tcp_mstamp is essentially tcp_clock_cache/1000, so the
apparent socket size increase is temporary.

Fixes: 9799ccb0e984 ("tcp: add tcp_wstamp_ns socket field")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:42 +07:00
Eric Dumazet
fdba84020d
UPSTREAM: tcp: add tcp_wstamp_ns socket field
TCP will soon provide earliest departure time on TX skbs.
It needs to track this in a new variable.

tcp_mstamp_refresh() needs to update this variable, and
became too big to stay an inline.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:42 +07:00
Yuchung Cheng
c115d206bd
UPSTREAM: tcp: refactor DCTCP ECN ACK handling
DCTCP has two parts - a new ECN signalling mechanism and the response
function to it. The first part can be used by other congestion
control for DCTCP-ECN deployed networks. This patch moves that part
into a separate tcp_dctcp.h to be used by other congestion control
module (like how Yeah uses Vegas algorithmas). For example, BBR is
experimenting such ECN signal currently
https://tinyurl.com/ietf-102-iccrg-bbr2

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:42 +07:00
Yuchung Cheng
a63f756221
UPSTREAM: tcp: avoid resetting ACK timer in DCTCP
The recent fix of acking immediately in DCTCP on CE status change
has an undesirable side-effect: it also resets TCP ack timer and
disables pingpong mode (interactive session). But the CE status
change has nothing to do with them. This patch addresses that by
using the new one-time immediate ACK flag instead of calling
tcp_enter_quickack_mode().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:41 +07:00
Yuchung Cheng
ffc3157b48
BACKPORT: tcp: mandate a one-time immediate ACK
Add a new flag to indicate a one-time immediate ACK. This flag is
occasionaly set under specific TCP protocol states in addition to
the more common quickack mechanism for interactive application.

In several cases in the TCP code we want to force an immediate ACK
but do not want to call tcp_enter_quickack_mode() because we do
not want to forget the icsk_ack.pingpong or icsk_ack.ato state.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:41 +07:00
Yousuk Seung
df2e152ce8
FROMGIT: net-tcp: add new ca opts flag TCP_CONG_WANTS_CE_EVENTS
Add a a new ca opts flag TCP_CONG_WANTS_CE_EVENTS that allows a
congestion control module to receive CE events.

Currently congestion control modules have to set the TCP_CONG_NEEDS_ECN
bit in opts flag to receive CE events but this may incur changes in ECN
behavior elsewhere. This patch adds a new bit TCP_CONG_WANTS_CE_EVENTS
that allows congestion control modules to receive CE events
independently of TCP_CONG_NEEDS_ECN.

Effort: net-tcp
Origin-9xx-SHA1: 9f7e14716cde760bc6c67ef8ef7e1ee48501d95b
Change-Id: I2255506985242f376d910c6fd37daabaf4744f24
(cherry-picked from 890c9fb3e2)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:41 +07:00
Neal Cardwell
210f86cec5
BACKPORT: FROMGIT: net-tcp_bbr: v2: set tx.in_flight for skbs in repair write queue
Syzkaller was able to use TCP_REPAIR to reproduce the new warning
added in tcp_fragment():

  WARNING: CPU: 0 PID: 118174 at net/ipv4/tcp_output.c:1487
    tcp_fragment+0xdcc/0x10a0 net/ipv4/tcp_output.c:1487()
  inconsistent: tx.in_flight: 0 old_factor: 53

The warning happens because skbs inserted into the tcp_rtx_queue
during the repair process go through a sort of "fake send" process,
and that process was seting pcount but not tx.in_flight, and thus the
warnings (where old_factor is the old pcount).

The fix of setting tx.in_flight in the TCP_REPAIR code path seems
simple enough, and indeed makes the repro code from syzkaller stop
producing warnings. Running through kokonut tests, and will send out
for review when all tests pass.

Effort: net-tcp_bbr
Origin-9xx-SHA1: 330f825a08a6fe92cef74d799cc468864c479f63
Change-Id: I0bc4a790f040fd4239620e1eedd5dc64666c6f05
(cherry-picked from 3d84624749)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:41 +07:00
Neal Cardwell
c72e3b4a7d
BACKPORT: FROMGIT: net-tcp_bbr: v2: adjust skb tx.in_flight upon split in tcp_fragment()
When we fragment an skb that has already been sent, we need to update
the tx.in_flight for the first skb in the resulting pair ("buff").

Because we were not updating the tx.in_flight, the tx.in_flight value
was inconsistent with the pcount of the "buff" skb (tx.in_flight would
be too high). That meant that if the "buff" skb was lost, then
bbr2_inflight_hi_from_lost_skb() would calculate an inflight_hi value
that is too high. This could result in longer queues and higher packet
loss.

Packetdrill testing verified that without this commit, when the second
half of an skb is SACKed and then later the first half of that skb is
marked lost, the calculated inflight_hi was incorrect.

Effort: net-tcp_bbr
Origin-9xx-SHA1: 385f1ddc610798fab2837f9f372857438b25f874
Change-Id: I617f8cab4e9be7a0b8e8d30b047bf8645393354d
(cherry-picked from 6a8839d2a6)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:41 +07:00
Neal Cardwell
54ffd91b3f
FROMGIT: net-tcp_bbr: v2: adjust skb tx.in_flight upon merge in tcp_shifted_skb()
When tcp_shifted_skb() updates state as adjacent SACKed skbs are
coalesced, previously the tx.in_flight was not adjusted, so we could
get contradictory state where the skb's recorded pcount was bigger
than the tx.in_flight (the number of segments that were in_flight
after sending the skb).

Normally have a SACKed skb with contradictory pcount/tx.in_flight
would not matter. However, with SACK reneging, the SACKed bit is
removed, and an skb once again becomes eligible for retransmitting,
fragmenting, SACKing, etc. Packetdrill testing verified the following
sequence is possible in a kernel that does not have this commit:

 - skb N is SACKed
 - skb N+1 is SACKed and combined with skb N using tcp_shifted_skb()
   - tcp_shifted_skb() will increase the pcount of prev,
     but leave tx.in_flight as-is
   - so prev skb can have pcount > tx.in_flight
 - RTO, tcp_timeout_mark_lost(), detect reneg,
   remove "SACKed" bit, mark skb N as lost
   - find pcount of skb N is greater than its tx.in_flight

I suspect this issue iw what caused the bbr2_inflight_hi_from_lost_skb():
  WARN_ON_ONCE(inflight_prev < 0)
to fire in production machines using bbr2.

Tested: See last commit in series for sponge link.

Effort: net-tcp_bbr
Origin-9xx-SHA1: 1a3e997e613d2dcf32b947992882854ebe873715
Change-Id: I1b0b75c27519953430c7db51c6f358f104c7af55
(cherry-picked from 0672a0d858)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:40 +07:00
Neal Cardwell
ec675cc5fa
FROMGIT: net-tcp_bbr: v2: factor out tx.in_flight setting into tcp_set_tx_in_flight()
Factor out the code to set an skb's tx.in_flight field into its own
function, so that this code can be used for the TCP_REPAIR "fake send"
code path that inserts skbs into the rtx queue without sending
them. This is in preparation for the following patch, which fixes an
issue with TCP_REPAIR and tx.in_flight.

Tested: See last patch in series for sponge link.

Effort: net-tcp_bbr
Origin-9xx-SHA1: e880fc907d06ea7354333f60f712748ebce9497b
Change-Id: I4fbd4a6e18a51ab06d50ab1c9ad820ce5bea89af
(cherry-picked from f14b4e2b1c)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:40 +07:00
Neal Cardwell
0d4961e2c8
FROMGIT: net-tcp_bbr: v2: introduce ca_ops->skb_marked_lost() CC module callback API
For connections experiencing reordering, RACK can mark packets lost
long after we receive the SACKs/ACKs hinting that the packets were
actually lost.

This means that CC modules cannot easily learn the volume of inflight
data at which packet loss happens by looking at the current inflight
or even the packets in flight when the most recently SACKed packet was
sent. To learn this, CC modules need to know how many packets were in
flight at the time lost packets were sent. This new callback, combined
with TCP_SKB_CB(skb)->tx.in_flight, allows them to learn this.

This also provides a consistent callback that is invoked whether
packets are marked lost upon ACK processing, using the RACK reordering
timer, or at RTO time.

Effort: net-tcp_bbr
Origin-9xx-SHA1: afcbebe3374e4632ac6714d39e4dc8a8455956f4
Change-Id: I54826ab53df636be537e5d3c618a46145d12d51a
(cherry-picked from c63c9581d6)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:40 +07:00
Neal Cardwell
2bc596b50b
FROMGIT: net-tcp_bbr: v2: export FLAG_ECE in rate_sample.is_ece
For understanding the relationship between inflight and ECN signals,
to try to find the highest inflight value that has acceptable levels
ECN marking.

Effort: net-tcp_bbr
Origin-9xx-SHA1: 3eba998f2898541406c2666781182200934965a8
Change-Id: I3a964e04cee83e11649a54507043d2dfe769a3b3
(cherry-picked from d51b05c0f0)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:39 +07:00
Neal Cardwell
93a9e043b6
FROMGIT: net-tcp_bbr: v2: count packets lost over TCP rate sampling interval
For understanding the relationship between inflight and packet loss
signals, to try to find the highest inflight value that has acceptable
levels of packet losses.

Effort: net-tcp_bbr
Origin-9xx-SHA1: 4527e26b2bd7756a88b5b9ef1ada3da33dd609ab
Change-Id: I594c2500868d9c530770e7ddd68ffc87c57f4fd5
(cherry-picked from ebe09b73d0)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:39 +07:00
Neal Cardwell
79ff476667
FROMGIT: net-tcp_bbr: v2: snapshot packets in flight at transmit time and pass in rate_sample
For understanding the relationship between inflight and losses or ECN
signals, to try to find the highest inflight value that has acceptable
levels of loss/ECN marking.

Effort: net-tcp_bbr
Origin-9xx-SHA1: b3eb4f2d20efab4ca001f32c9294739036c493ea
Change-Id: I7314047d0ff14dd261a04b1969a46dc658c8836a
(cherry-picked from 1379c86f3f)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:39 +07:00
Yuchung Cheng
88596c7558
FROMGIT: net-tcp_rate: account for CE marks in rate sample
This patch counts number of packets delivered have CE mark in the
rate sample, using similar approach of delivery accounting.

Effort: net-tcp_rate
Origin-9xx-SHA1: 710644db434c3da335a7c8b72207a671ccbb5cf8
Change-Id: I0968fb33fe19b5c774e8c3afd2685558a6ec8710
(cherry-picked from 0180077856)
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:39 +07:00
Yuchung Cheng
c7402958d9
BACKPORT: tcp: track total bytes delivered with ECN CE marks
Introduce a new delivered_ce stat in tcp socket to estimate
number of packets being marked with CE bits. The estimation is
done via ACKs with ECE bit. Depending on the actual receiver
behavior, the estimation could have biases.

Since the TCP sender can't really see the CE bit in the data path,
so the sender is technically counting packets marked delivered with
the "ECE / ECN-Echo" flag set.

With RFC3168 ECN, because the ECE bit is sticky, this count can
drastically overestimate the nummber of CE-marked data packets

With DCTCP-style ECN this should be reasonably precise unless there
is loss in the ACK path, in which case it's not precise.

With AccECN proposal this can be made still more precise, even in
the case some degree of ACK loss.

However this is sender's best estimate of CE information.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:39 +07:00
Yuchung Cheng
90ee8d65e9
BACKPORT: tcp: new helper to calculate newly delivered
Add new helper tcp_newly_delivered() to prepare the ECN accounting change.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:38 +07:00
Wei Wang
b090762b3e
[BACKPORT]tcp: change pingpong threshold to 3
In order to be more confident about an on-going interactive session, we
increment pingpong count by 1 for every interactive transaction and we
adjust TCP_PINGPONG_THRESH to 3.
This means, we only consider a session in pingpong mode after we see 3
interactive transactions, and start to activate delayed acks in quick
ack mode.
And in order to not over-count the credits, we only increase pingpong
count for the first packet sent in response for the previous received
packet.
This is mainly to prevent delaying the ack immediately after some
handshake protocol but no real interactive traffic pattern afterwards.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: UtsavisGreat <utsavbalar1231@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Panchajanya Sarkar <panchajanya@azure-dev.live>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:38 +07:00
Wei Wang
c914b5be2d
[BACKPORT]tcp: Refactor pingpong code
Instead of using pingpong as a single bit information, we refactor the
code to treat it as a counter. When interactive session is detected,
we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.

This patch is a pure refactor and sets foundation for the next patch.
This patch itself does not change any pingpong logic.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: UtsavisGreat <utsavbalar1231@gmail.com>
Signed-off-by: Panchajanya1999 <panchajanya@azure-dev.live>
Signed-off-by: Panchajanya Sarkar <panchajanya@azure-dev.live>
Signed-off-by: Forenche <prahul2003@gmail.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:38 +07:00
Kevin(Yudong) Yang
e0c51003f3
FROMLIST: tcp_bbr: fix quantization code to not raise cwnd if not probing bandwidth
There was a bug in the previous logic that attempted to ensure gain cycling
gets inflight above BDP even for small BDPs. This code correctly raised and
lowered target inflight values during the gain cycle. And this code
correctly ensured that cwnd was raised when probing bandwidth. However, it
did not correspondingly ensure that cwnd was *not* raised in this way when
*not* probing for bandwidth. The result was that small-BDP flows that were
always cwnd-bound could go for many cycles with a fixed cwnd, and not probe
or yield bandwidth at all. This meant that multiple small-BDP flows could
fail to converge in their bandwidth allocations.

Fixes: 3c346b233c68 ("tcp_bbr: fix bw probing to raise in-flight data for very small BDPs")
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:38 +07:00
Luke Hsiao
b18b10557d
tcp_bbr: clarify that bbr_bdp() rounds up in comments
This explicitly clarifies that bbr_bdp() returns the rounded-up value of
the bandwidth-delay product and why in the comments.

Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:37 +07:00
Neal Cardwell
c8cb80e5b7
BACKPORT: FROMGIT: net-tcp_bbr: v2: shrink delivered_mstamp, first_tx_mstamp to u32 to free up 8 bytes
Free up some space for tracking inflight and losses for each
bw sample, in upcoming commits.

These timestamps are in microseconds, and are now stored in 32
bits. So they can only hold time intervals up to roughly 2^12 = 4096
seconds.  But Linux TCP RTT and RTO tracking has the same 32-bit
microsecond implementation approach and resulting deployment
limitations. So this is not introducing a new limit. And these should
not be a limitation for the foreseeable future.

Effort: net-tcp_bbr
Origin-9xx-SHA1: 238a7e6b5d51625fef1ce7769826a7b21b02ae55
Change-Id: I3b779603797263b52a61ad57c565eb91fe42680c
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:37 +07:00
Yuchung Cheng
c0eb978e51
BACKPORT: FROMGIT: net-tcp_rate: consolidate inflight tracking approaches in TCP
In order to track CE marks per rate sample (one round trip), we'll
need to snap the starting tcp delivered_ce acount in the packet
meta header (tcp_skb_cb). But there's not enough space.

Good news is that the "last_in_flight" in the header, used by
NV congestion control, is almost equivalent as "delivered". In
fact "delivered" is better by accounting out-of-order packets
additionally.  Therefore we can remove it to make room for the
CE tracking.

This would make delayed ACK detection slightly less accurate but the
impact is negligible since it's not used for any critical control.

Effort: net-tcp_rate
Origin-9xx-SHA1: ddcd46ec85d5f1c4454258af0c54b3254c0d64a7
Change-Id: I1a184aad6d101c981ac7f2f275aa9417ff856910
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:37 +07:00
Neal Cardwell
5a2182dd7c
FROMGIT: net-tcp_bbr: broaden app-limited rate sample detection
This commit is a bug fix for the Linux TCP app-limited logic
used for collecting rate (bandwidth) samples.

Previously the app-limited logic only looked for "bubbles" of
silence in between application writes, by checking at the start
of each sendmsg. But "bubbles" of silence can also happen before
retransmits: e.g. bubbles can happen between an application write
and a retransmit, or between two retransmits.

Retransmits are triggered by ACKs or timers. So this commit checks
for bubbles of app-limited silence upon ACKs or timers.

Why does this commit check for app-limited state at the start of
ACKs and timer handling? Because at that point we know whether
inflight was fully using the cwnd.  During processing the ACK or
timer event we often change the cwnd; after changing the cwnd we
can't know whether inflight was fully using the old cwnd.

Origin-9xx-SHA1: 3fe9b53291e018407780fb8c356adb5666722cbc
Change-Id: I37221506f5166877c2b110753d39bb0757985e68
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:37 +07:00
Priyaranjan Jha
74e6d12d22
tcp_bbr: adapt cwnd based on ack aggregation estimation
Aggregation effects are extremely common with wifi, cellular, and cable
modem link technologies, ACK decimation in middleboxes, and LRO and GRO
in receiving hosts. The aggregation can happen in either direction,
data or ACKs, but in either case the aggregation effect is visible
to the sender in the ACK stream.

Previously BBR's sending was often limited by cwnd under severe ACK
aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets
were acked in bursts after long delays (e.g. one ACK acking 5*BDP after
5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving
the bottleneck idle for potentially long periods. Note that loss-based
congestion control does not have this issue because when facing
aggregation it continues increasing cwnd after bursts of ACKs, growing
cwnd until the buffer is full.

To achieve good throughput in the presence of aggregation effects, this
algorithm allows the BBR sender to put extra data in flight to keep the
bottleneck utilized during silences in the ACK stream that it has evidence
to suggest were caused by aggregation.

A summary of the algorithm: when a burst of packets are acked by a
stretched ACK or a burst of ACKs or both, BBR first estimates the expected
amount of data that should have been acked, based on its estimated
bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max
filter to estimate the recent level of observed ACK aggregation. Then cwnd
is increased by the ACK aggregation estimate. The larger cwnd avoids BBR
being cwnd-limited in the face of ACK silences that recent history suggests
were caused by aggregation. As a sanity check, the ACK aggregation degree
is upper-bounded by the cwnd (at the time of measurement) and a global max
of BW * 100ms. The algorithm is further described by the following
presentation:
https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00

In our internal testing, we observed a significant increase in BBR
throughput (measured using netperf), in a basic wifi setup.
- Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi)
- 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps
- 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps

Also, this code is running globally on YouTube TCP connections and produced
significant bandwidth increases for YouTube traffic.

This is based on Ian Swett's max_ack_height_ algorithm from the
QUIC BBR implementation.

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:37 +07:00
Priyaranjan Jha
3bccdbf1e9
tcp_bbr: refactor bbr_target_cwnd() for general inflight provisioning
Because bbr_target_cwnd() is really a general-purpose BBR helper for
computing some volume of inflight data as a function of the estimated
BDP, refactor it into following helper functions:
- bbr_bdp()
- bbr_quantization_budget()
- bbr_inflight()

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:36 +07:00
Neal Cardwell
4c8437cdf6
tcp_bbr: centralize code to set gains
Centralize the code that sets gains used for computing cwnd and pacing
rate. This simplifies the code and makes it easier to change the state
machine or (in the future) dynamically change the gain values and
ensure that the correct gain values are always used.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:36 +07:00
Kevin Yang
07df0f4dae
tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0
This commit fixes a corner case where TCP BBR would enter PROBE_RTT
mode but not reduce its cwnd. If a TCP receiver ACKed less than one
full segment, the number of delivered/acked packets was 0, so that
bbr_set_cwnd() would short-circuit and exit early, without cutting
cwnd to the value we want for PROBE_RTT.

The fix is to instead make sure that even when 0 full packets are
ACKed, we do apply all the appropriate caps, including the cap that
applies in PROBE_RTT mode.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:36 +07:00
Kevin Yang
2c9e2d7790
tcp_bbr: in restart from idle, see if we should exit PROBE_RTT
This patch fix the case where BBR does not exit PROBE_RTT mode when
it restarts from idle. When BBR restarts from idle and if BBR is in
PROBE_RTT mode, BBR should check if it's time to exit PROBE_RTT. If
yes, then BBR should exit PROBE_RTT mode and restore the cwnd to its
full value.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:36 +07:00
Kevin Yang
a7c0031278
tcp_bbr: add bbr_check_probe_rtt_done() helper
This patch add a helper function bbr_check_probe_rtt_done() to
  1. check the condition to see if bbr should exit probe_rtt mode;
  2. process the logic of exiting probe_rtt mode.

Fixes: 0f8782ea1497 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:36 +07:00
Eric Dumazet
9610772a33
tcp_bbr: fix bbr pacing rate for internal pacing
This commit makes BBR use only the MSS (without any headers) to
calculate pacing rates when internal TCP-layer pacing is used.

This is necessary to achieve the correct pacing behavior in this case,
since tcp_internal_pacing() uses only the payload length to calculate
pacing delays.

Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:35 +07:00
Yousuk Seung
392ce07c9e
net-tcp_bbr: set tp->snd_ssthresh to BDP upon STARTUP exit
Set tp->snd_ssthresh to BDP upon STARTUP exit. This allows us
to check if a BBR flow exited STARTUP and the BDP at the
time of STARTUP exit with SCM_TIMESTAMPING_OPT_STATS. Since BBR does not
use snd_ssthresh this fix has no impact on BBR's behavior.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:35 +07:00
Eric Dumazet
b560f12f84
tcp_bbr: remove bbr->tso_segs_goal
Its value is computed then immediately used,
there is no need to store it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:35 +07:00
Eric Dumazet
93816e484a
tcp_bbr: better deal with suboptimal GSO (II)
This is second part of dealing with suboptimal device gso parameters.
In first patch (350c9f484bde "tcp_bbr: better deal with suboptimal GSO")
we dealt with devices having low gso_max_segs

Some devices lower gso_max_size from 64KB to 16 KB (r8152 is an example)

In order to probe an optimal cwnd, we want BBR being not sensitive
to whatever GSO constraint a device can have.

This patch removes tso_segs_goal() CC callback in favor of
min_tso_segs() for CC wanting to override sysctl_tcp_min_tso_segs

Next patch will remove bbr->tso_segs_goal since it does not have
to be persistent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:35 +07:00
Yuchung Cheng
4c8f616cdb
tcp: avoid min RTT bloat by skipping RTT from delayed-ACK in BBR
A persistent connection may send tiny amount of data (e.g. health-check)
for a long period of time. BBR's windowed min RTT filter may only see
RTT samples from delayed ACKs causing BBR to grossly over-estimate
the path delay depending how much the ACK was delayed at the receiver.

This patch skips RTT samples that are likely coming from delayed ACKs. Note
that it is possible the sender never obtains a valid measure to set the
min RTT. In this case BBR will continue to set cwnd to initial window
which seems fine because the connection is thin stream.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:34 +07:00
Yuchung Cheng
ffe0e8ab27
tcp: avoid min-RTT overestimation from delayed ACKs
This patch avoids having TCP sender or congestion control
overestimate the min RTT by orders of magnitude. This happens when
all the samples in the windowed filter are one-packet transfer
like small request and health-check like chit-chat, which is farily
common for applications using persistent connections. This patch
tries to conservatively labels and skip RTT samples obtained from
this type of workload.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:34 +07:00
zhong jiang
80f86573cb
ipv6: remove redundant null pointer check before kfree_skb
kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jesse Chan <jc@linux.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:34 +07:00
zhong jiang
eb686bd100
ipv4: remove redundant null pointer check before kfree_skb
kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jesse Chan <jc@linux.com>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:34 +07:00
Danny Lin
d94ce42d94
tcp: Enable ECN negotiation by default
This is now the default for all connections in iOS 11+, and we have
fallback to detect and disable ECN for broken flows.

Signed-off-by: Danny Lin <danny@kdrag0n.dev>
Signed-off-by: azrim <mirzaspc@gmail.com>
2022-04-06 13:18:33 +07:00