This new expression uses the nf_dup engine to clone packets to a given gateway.
Unlike xt_TEE, we use an index to indicate output interface which should be
fine at this stage.
Moreover, change to the preemtion-safe this_cpu_read(nf_skb_duplicated) from
nf_dup_ipv{4,6} to silence a lockdep splat.
Based on the original tee expression from Arturo Borrero Gonzalez, although
this patch has diverted quite a bit from this initial effort due to the
change to support maps.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Extracted from the xtables TEE target. This creates two new modules for IPv4
and IPv6 that are shared between the TEE target and the new nf_tables dup
expressions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch converts the existing seqlock to per-cpu counters.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 842a9ae08a25 ("bridge: Extend Proxy ARP design to allow optional rules for Wi-Fi")
Signed-off-by: David S. Miller <davem@davemloft.net>
The attribute size wasn't accounted for in the get_slave_size() callback
(br_port_get_slave_size) when it was introduced, so fix it now. Also add
a policy entry for it in br_port_policy.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP")
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 1fbe4b46caca "net: pktgen: kill the Wait for kthread_stop
code in pktgen_thread_worker()" removed (in particular) the final
__set_current_state(TASK_RUNNING) and I didn't notice the previous
set_current_state(TASK_INTERRUPTIBLE). This triggers the warning
in __might_sleep() after return.
Afaics, we can simply remove both set_current_state()'s, and we
could do this a long ago right after ef87979c273a2 "pktgen: better
scheduler friendliness" which changed pktgen_thread_worker() to
use wait_event_interruptible_timeout().
Reported-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds null dev check for the 'cfg->rc_via_table ==
NEIGH_LINK_TABLE or dev_get_by_index() failed' case
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We recently changed this code from returning NULL to returning ERR_PTR.
There are some left over NULL assignments which we can remove. We can
preserve the error code from ip_route_output() instead of always
returning -ENODEV. Also these functions use a mix of gotos and direct
returns. There is no cleanup necessary so I changed the gotos to
direct returns.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Acked-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec ("net: Clone
skb before setting peeked flag") introduced a use-after-free bug
in skb_recv_datagram. This is because skb_set_peeked may create
a new skb and free the existing one. As it stands the caller will
continue to use the old freed skb.
This patch fixes it by making skb_set_peeked return the new skb
(or the old one if unchanged).
Fixes: 738ac1ebb96d ("net: Clone skb before setting peeked flag")
Reported-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Tested-by: Brenden Blanco <bblanco@plumgrid.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes how MGMT_EV_NEW_LONG_TERM_KEY event is build. Right now
val vield is filled with only 1 byte, instead of whole value. This bug
was introduced in
commit 1fc62c526a57 ("Bluetooth: Fix exposing full value of shortened LTKs")
Before that patch, if you paired with device using bluetoothd using simple
pairing, and then restarted bluetoothd, you would be able to re-connect,
but device would fail to establish encryption and would terminate
connection. After this patch connecting after bluetoothd restart works
fine.
Signed-off-by: Jakub Pawlowski <jpawlowski@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The flags were ignored for this function when it was introduced. Also
fix the style problem in kzalloc.
Fixes: 0838aa7fc (netfilter: fix netns dependencies with conntrack
templates)
Signed-off-by: Joe Stringer <joestringer@nicira.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next, they are:
1) A couple of cleanups for the netfilter core hook from Eric Biederman.
2) Net namespace hook registration, also from Eric. This adds a dependency with
the rtnl_lock. This should be fine by now but we have to keep an eye on this
because if we ever get the per-subsys nfnl_lock before rtnl we have may
problems in the future. But we have room to remove this in the future by
propagating the complexity to the clients, by registering hooks for the init
netns functions.
3) Update nf_tables to use the new net namespace hook infrastructure, also from
Eric.
4) Three patches to refine and to address problems from the new net namespace
hook infrastructure.
5) Switch to alternate jumpstack in xtables iff the packet is reentering. This
only applies to a very special case, the TEE target, but Eric Dumazet
reports that this is slowing down things for everyone else. So let's only
switch to the alternate jumpstack if the tee target is in used through a
static key. This batch also comes with offline precalculation of the
jumpstack based on the callchain depth. From Florian Westphal.
6) Minimal SCTP multihoming support for our conntrack helper, from Michal
Kubecek.
7) Reduce nf_bridge_info per skbuff scratchpad area to 32 bytes, from Florian
Westphal.
8) Fix several checkpatch errors in bridge netfilter, from Bernhard Thaler.
9) Get rid of useless debug message in ip6t_REJECT, from Subash Abhinov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Without this initialization, gateways which actually announce up/down
bandwidth of 0/0 could be added. If these nodes get purged via
_batadv_purge_orig() later, the gw_node structure does not get removed
since batadv_gw_node_delete() updates the gw_node with up/down
bandwidth of 0/0, and the updating function then discards the change
and does not free gw_node.
This results in leaking the gw_node structures, which references other
structures: gw_node -> orig_node -> orig_node_ifinfo -> hardif. When
removing the interface later, the open reference on the hardif may cause
hangs with the infamous "unregister_netdevice: waiting for mesh1 to
become free. Usage count = 1" message.
Signed-off-by: Simon Wunderlich <simon@open-mesh.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
The tt_local_entry deletion performed in batadv_tt_local_remove() was neither
protecting against simultaneous deletes nor checking whether the element was
still part of the list before calling hlist_del_rcu().
Replacing the hlist_del_rcu() call with batadv_hash_remove() provides adequate
protection via hash spinlocks as well as an is-element-still-in-hash check to
avoid 'blind' hash removal.
Fixes: 068ee6e204e1 ("batman-adv: roaming handling mechanism redesign")
Reported-by: alfonsname@web.de
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
batadv_softif_vlan_get() may return NULL which has to be verified
by the caller.
Fixes: 35df3b298fc8 ("batman-adv: fix TT VLAN inconsistency on VLAN re-add")
Reported-by: Ryan Thompson <ryan@eero.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
When a node running DAT receives an ARP request from the LAN for the
first time, it is likely that this node will request the ARP entry
through the distributed ARP table (DAT) in the mesh.
Once a DAT reply is received the asking node must check if the MAC
address for which the IP address has been asked is local. If it is, the
node must drop the ARP reply bceause the client should have replied on
its own locally.
Forwarding this reply means fooling any L2 bridge (e.g. Ethernet
switches) lying between the batman-adv node and the LAN. This happens
because the L2 bridge will think that the client sending the ARP reply
lies somewhere in the mesh, while this node is sitting in the same LAN.
Reported-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Make it similar to reject_tg() in ipt_REJECT.
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
In multiple locations there are checks for whether the label in hand
is a reserved label or not using the arbritray value of 16. Factor
this out into a #define for better maintainability and for
documentation.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
lwtunnel encap is applied for forwarded packets, but not for
locally-generated packets. This is because the output function is not
overridden in __mkroute_output, unlike it is in __mkroute_input.
The lwtunnel state is correctly set on the rth through the call to
rt_set_nexthop, so all that needs to be done is to override the dst
output function to be lwtunnel_output if there is lwtunnel state
present and it requires output redirection.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the locally-generated packet path skb->protocol may not be set and
this is required for the lwtunnel encap in order to get the lwtstate.
This would otherwise have been set by ip_output or ip6_output so set
skb->protocol prior to calling the lwtunnel encap
function. Additionally set skb->dev in case it is needed further down
the transmit path.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multicast dst are not cached. They carry DST_NOCACHE.
As mentioned in commit f8864972126899 ("ipv4: fix dst race in
sk_dst_get()"), these dst need special care before caching them
into a socket.
Caching them is allowed only if their refcnt was not 0, ie we
must use atomic_inc_not_zero()
Also, we must use READ_ONCE() to fetch sk->sk_rx_dst, as mentioned
in commit d0c294c53a771 ("tcp: prevent fetching dst twice in early demux
code")
Fixes: 421b3885bf6d ("udp: ipv4: Add udp early demux")
Tested-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Gregory Hoggarth <Gregory.Hoggarth@alliedtelesis.co.nz>
Reported-by: Alex Gartrell <agartrell@fb.com>
Cc: Michal Kubeček <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of trying to access br->vlan_enabled directly use the provided
helper br_vlan_enabled().
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the introduction of the BPF action in d23b8ad8ab23 ("tc: add BPF
based action"), late binding was not working as expected. I.e. setting
the action part for a classifier only via 'bpf index <num>', where <num>
is the index of an existing action, is being rejected by the kernel due
to other missing parameters.
It doesn't make sense to require these parameters such as BPF opcodes
etc, as they are not going to be used anyway: in this case, they're just
allocated/parsed and then freed again w/o doing anything meaningful.
Instead, parse and verify the remaining parameters *after* the test on
tcf_hash_check(), when we really know that we're dealing with creation
of a new action or replacement of an existing one and where late binding
is thus irrelevant.
After patch, test case is now working:
FOO="1,6 0 0 4294967295,"
tc actions add action bpf bytecode "$FOO"
tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action bpf index 1
tc actions show action bpf
action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 1 ref 2 bind 1
tc filter show dev foo
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 flowid 1:1 bytecode '1,6 0 0 4294967295'
action order 1: bpf bytecode '1,6 0 0 4294967295' default-action pipe
index 1 ref 2 bind 1
Late binding of a BPF action can be useful for preloading maps (e.g. before
they hit traffic) in case of eBPF programs, or to share a single eBPF action
with multiple classifiers.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch when a vid was not specified, the entry was added with
vid 0 which is useless when vlan_filtering is enabled. This patch makes
the entry to be added on all configured vlans when vlan filtering is
enabled and respectively deleted from all, if the entry vid is 0.
This is also closer to the way fdb works with regard to vid 0 and vlan
filtering.
Example:
Setup:
$ bridge vlan add vid 256 dev eth4
$ bridge vlan add vid 1024 dev eth4
$ bridge vlan add vid 64 dev eth3
$ bridge vlan add vid 128 dev eth3
$ bridge vlan
port vlan ids
eth3 1 PVID Egress Untagged
64
128
eth4 1 PVID Egress Untagged
256
1024
$ echo 1 > /sys/class/net/br0/bridge/vlan_filtering
Before:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp
After:
$ bridge mdb add dev br0 port eth3 grp 239.0.0.1
$ bridge mdb
dev br0 port eth3 grp 239.0.0.1 temp vid 1
dev br0 port eth3 grp 239.0.0.1 temp vid 128
dev br0 port eth3 grp 239.0.0.1 temp vid 64
Signed-off-by: Satish Ashok <sashok@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
"len" is a signed integer. We check that len is not negative, so it
goes from zero to INT_MAX. PAGE_SIZE is unsigned long so the comparison
is type promoted to unsigned long. ULONG_MAX - 4095 is a higher than
INT_MAX so the condition can never be true.
I don't know if this is harmful but it seems safe to limit "len" to
INT_MAX - 4095.
Fixes: a8c879a7ee98 ('RDS: Info and stats')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Bridge devices don't need to segment multiple tagged packets since thier
ports can segment them.
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
openvswitch modifies the L4 checksum of a packet when modifying
the ip address. When an IP packet is fragmented only the first
fragment contains an L4 header and checksum. Prior to this change
openvswitch would modify all fragments, modifying application data
in non-first fragments, causing checksum failures in the
reassembled packet.
Signed-off-by: Glenn Griffin <ggriffin.kernel@gmail.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add skb->hash to the __sk_buff offset map, so it can be accessed from
an eBPF program. We currently already do this for classic BPF filters,
but not yet on eBPF, it might be useful as a demuxer in combination with
helpers like bpf_clone_redirect(), toy example:
__section("cls-lb") int ingress_main(struct __sk_buff *skb)
{
unsigned int which = 3 + (skb->hash & 7);
/* bpf_skb_store_bytes(skb, ...); */
/* bpf_l{3,4}_csum_replace(skb, ...); */
bpf_clone_redirect(skb, which, 0);
return -1;
}
I was thinking whether to add skb_get_hash(), but then concluded the
raw skb->hash seems fine in this case: we can directly access the hash
w/o extra eBPF helper function call, it's filled out by many NICs on
ingress, and in case the entropy level would not be sufficient, people
can still implement their own specific sw fallback hash mix anyway.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex reported the following crash when using fq_codel
with htb:
crash> bt
PID: 630839 TASK: ffff8823c990d280 CPU: 14 COMMAND: "tc"
[... snip ...]
#8 [ffff8820ceec17a0] page_fault at ffffffff8160a8c2
[exception RIP: htb_qlen_notify+24]
RIP: ffffffffa0841718 RSP: ffff8820ceec1858 RFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88241747b400
RDX: ffff88241747b408 RSI: 0000000000000000 RDI: ffff8811fb27d000
RBP: ffff8820ceec1868 R8: ffff88120cdeff24 R9: ffff88120cdeff30
R10: 0000000000000bd4 R11: ffffffffa0840919 R12: ffffffffa0843340
R13: 0000000000000000 R14: 0000000000000001 R15: ffff8808dae5c2e8
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [...] qdisc_tree_decrease_qlen at ffffffff81565375
#10 [...] fq_codel_dequeue at ffffffffa084e0a0 [sch_fq_codel]
#11 [...] fq_codel_reset at ffffffffa084e2f8 [sch_fq_codel]
#12 [...] qdisc_destroy at ffffffff81560d2d
#13 [...] htb_destroy_class at ffffffffa08408f8 [sch_htb]
#14 [...] htb_put at ffffffffa084095c [sch_htb]
#15 [...] tc_ctl_tclass at ffffffff815645a3
#16 [...] rtnetlink_rcv_msg at ffffffff81552cb0
[... snip ...]
As Jamal pointed out, there is actually no need to call dequeue
to purge the queued skb's in reset, data structures can be just
reset explicitly. Therefore, we reset everything except config's
and stats, so that we would have a fresh start after device flipping.
Fixes: 4b549a2ef4be ("fq_codel: Fair Queue Codel AQM")
Reported-by: Alex Gartrell <agartrell@fb.com>
Cc: Alex Gartrell <agartrell@fb.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
[xiyou.wangcong@gmail.com: added codel_vars_init() and qdisc_qstats_backlog_dec()]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
arch/s390/net/bpf_jit_comp.c
drivers/net/ethernet/ti/netcp_ethss.c
net/bridge/br_multicast.c
net/ipv4/ip_fragment.c
All four conflicts were cases of simple overlapping
changes.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
1) Must teardown SR-IOV before unregistering netdev in igb driver, from
Alex Williamson.
2) Fix ipv6 route unreachable crash in IPVS, from Alex Gartrell.
3) Default route selection in ipv4 should take the prefix length, table
ID, and TOS into account, from Julian Anastasov.
4) sch_plug must have a reset method in order to purge all buffered
packets when the qdisc is reset, likewise for sch_choke, from WANG
Cong.
5) Fix deadlock and races in slave_changelink/br_setport in bridging.
From Nikolay Aleksandrov.
6) mlx4 bug fixes (wrong index in port even propagation to VFs,
overzealous BUG_ON assertion, etc.) from Ido Shamay, Jack
Morgenstein, and Or Gerlitz.
7) Turn off klog message about SCTP userspace interface compat that
makes no sense at all, from Daniel Borkmann.
8) Fix unbounded restarts of inet frag eviction process, causing NMI
watchdog soft lockup messages, from Florian Westphal.
9) Suspend/resume fixes for r8152 from Hayes Wang.
10) Fix busy loop when MSG_WAITALL|MSG_PEEK is used in TCP recv, from
Sabrina Dubroca.
11) Fix performance regression when removing a lot of routes from the
ipv4 routing tables, from Alexander Duyck.
12) Fix device leak in AF_PACKET, from Lars Westerhoff.
13) AF_PACKET also has a header length comparison bug due to signedness,
from Alexander Drozdov.
14) Fix bug in EBPF tail call generation on x86, from Daniel Borkmann.
15) Memory leaks, TSO stats, watchdog timeout and other fixes to
thunderx driver from Sunil Goutham and Thanneeru Srinivasulu.
16) act_bpf can leak memory when replacing programs, from Daniel
Borkmann.
17) WOL packet fixes in gianfar driver, from Claudiu Manoil.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (79 commits)
stmmac: fix missing MODULE_LICENSE in stmmac_platform
gianfar: Enable device wakeup when appropriate
gianfar: Fix suspend/resume for wol magic packet
gianfar: Fix warning when CONFIG_PM off
act_pedit: check binding before calling tcf_hash_release()
net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
net: sched: fix refcount imbalance in actions
r8152: reset device when tx timeout
r8152: add pre_reset and post_reset
qlcnic: Fix corruption while copying
act_bpf: fix memory leaks when replacing bpf programs
net: thunderx: Fix for crash while BGX teardown
net: thunderx: Add PCI driver shutdown routine
net: thunderx: Fix crash when changing rss with mutliple traffic flows
net: thunderx: Set watchdog timeout value
net: thunderx: Wakeup TXQ only if CQE_TX are processed
net: thunderx: Suppress alloc_pages() failure warnings
net: thunderx: Fix TSO packet statistic
net: thunderx: Fix memory leak when changing queue count
net: thunderx: Fix RQ_DROP miscalculation
...
Per RFC6437 stateful flow labels (e.g. labels set by flow label manager)
cannot "disturb" nodes taking part in stateless flow labels. While the
ranges only reduce the flow label entropy by one bit, it is conceivable
that this might bias the algorithm on some routers causing a load
imbalance. For best results on the Internet we really need the full
20 bits.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change the meaning of net.ipv6.auto_flowlabels to provide a mode for
automatic flow labels generation. There are four modes:
0: flow labels are disabled
1: flow labels are enabled, sockets can opt-out
2: flow labels are allowed, sockets can opt-in
3: flow labels are enabled and enforced, no opt-out for sockets
np->autoflowlabel is initialized according to the sysctl value.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can't call skb_get_hash here since the packet is not complete to do
flow_dissector. Create hash based on flowi6 instead.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add skb_get_hash_flowi6 and skb_get_hash_flowi4 which derive an sk_buff
hash from flowi6 and flowi4 structures respectively. These functions
can be called when creating a packet in the output path where the new
sk_buff does not yet contain a fully formed packet that is parsable by
flow dissector.
Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for using DSA slave network devices with netconsole, which
requires us to allocate and free custom netpoll instances and invoke the
parent network device poll controller callback.
In order for netconsole to work, we need to construct the DSA tag, but
not queue the skb for transmission on the master network device xmit
function.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All tagging protocols do the same thing: increment device statistics,
make room for the tag to be inserted, create the tag, invoke the parent
network device transmit function.
In order to prepare for adding netpoll support, which requires the tag
creation, but not using the parent network device transmit function, do
some little refactoring which eliminates duplication between the 4
tagging protocols supported.
We need to return a sk_buff pointer back to the caller because the tag
specific transmit function may have to reallocate the original skb (e.g:
tag_trailer.c) and this is the one we should be transmitting, not the
original sk_buff we were passed.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When we share an action within a filter, the bind refcnt
should increase, therefore we should not call tcf_hash_release().
Fixes: 1a29321ed045 ("net_sched: act: Dont increment refcnt on replace")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Undefined reference to ip6_route_output and ip_route_output
was reported with CONFIG_INET=n and CONFIG_IPV6=n.
This patch uses ipv6_stub_impl.ipv6_dst_lookup instead of
ip6_route_output. And wraps affected code under
IS_ENABLED(CONFIG_INET) and IS_ENABLED(CONFIG_IPV6).
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds net argument to ipv6_stub_impl.ipv6_dst_lookup
for use cases where sk is not available (like mpls).
sk appears to be needed to get the namespace 'net' and is optional
otherwise. This patch series changes ipv6_stub_impl.ipv6_dst_lookup
to take net argument. sk remains optional.
All callers of ipv6_stub_impl.ipv6_dst_lookup have been modified
to pass net. I have modified them to use already available
'net' in the scope of the call. I can change them to
sock_net(sk) to avoid any unintended change in behaviour if sock
namespace is different. They dont seem to be from code inspection.
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce helpers to let eBPF programs attached to TC manipulate tunnel metadata:
bpf_skb_[gs]et_tunnel_key(skb, key, size, flags)
skb: pointer to skb
key: pointer to 'struct bpf_tunnel_key'
size: size of 'struct bpf_tunnel_key'
flags: room for future extensions
First eBPF program that uses these helpers will allocate per_cpu
metadata_dst structures that will be used on TX.
On RX metadata_dst is allocated by tunnel driver.
Typical usage for TX:
struct bpf_tunnel_key tkey;
... populate tkey ...
bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), 0);
bpf_clone_redirect(skb, vxlan_dev_ifindex, 0);
RX:
struct bpf_tunnel_key tkey = {};
bpf_skb_get_tunnel_key(skb, &tkey, sizeof(tkey), 0);
... lookup or redirect based on tkey ...
'struct bpf_tunnel_key' will be extended in the future by adding
elements to the end and the 'size' argument will indicate which fields
are populated, thereby keeping backwards compatibility.
The 'flags' argument may be used as well when the 'size' is not enough or
to indicate completely different layout of bpf_tunnel_key.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
We simplify the link creation function tipc_link_create() and the way
the link struct it is connected to the node struct. In particular, we
remove the duplicate initialization of some fields which are anyway set
in tipc_link_reset().
Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when we extract small messages from a message bundle, or
when many messages have accumulated in the link arrival queue, those
messages are added one by one to the lock protected link input queue.
This may increase contention with the reader of that queue, in
the function tipc_sk_rcv().
This commit introduces a temporary, unprotected input queue in
tipc_link_rcv() for such cases. Only when the arrival queue has been
emptied, and the function is ready to return, does it splice the whole
temporary queue into the real input queue.
Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the most recent changes, all access calls to a link which
may entail addition of messages to the link's input queue are
postpended by an explicit call to tipc_sk_rcv(), using a reference
to the correct queue.
This means that the potentially hazardous implicit delivery, using
tipc_node_unlock() in combination with a binary flag and a cached
queue pointer, now has become redundant.
This commit removes this implicit delivery mechanism both for regular
data messages and for binding table update messages.
Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to facilitate future improvements to the locking structure, we
want to make resetting and establishing of links non-atomic. I.e., the
functions tipc_node_link_up() and tipc_node_link_down() should be called
from outside the node lock context, and grab/release the node lock
themselves. This requires that we can freeze the link state from the
moment it is set to RESETTING or PEER_RESET in one lock context until
it is set to RESET or ESTABLISHING in a later context. The recently
introduced link FSM makes this possible, so we are now ready to introduce
the above change.
This commit implements this.
Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The node lock is currently grabbed and and released in the function
tipc_disc_rcv() in the file discover.c. As a preparation for the next
commits, we need to move this node lock handling, along with the code
area it is covering, to node.c.
This commit introduces this change.
Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Until now, we have been handling link failover and synchronization
by using an additional link state variable, "exec_mode". This variable
is not independent of the link FSM state, something causing a risk of
inconsistencies, apart from the fact that it clutters the code.
The conditions are now in place to define a new link FSM that covers
all existing use cases, including failover and synchronization, and
eliminate the "exec_mode" field altogether. The FSM must also support
non-atomic resetting of links, which will be introduced later.
The new link FSM is shown below, with 7 states and 8 events.
Only events leading to state change are shown as edges.
+------------------------------------+
|RESET_EVT |
| |
| +--------------+
| +-----------------| SYNCHING |-----------------+
| |FAILURE_EVT +--------------+ PEER_RESET_EVT|
| | A | |
| | | | |
| | | | |
| | |SYNCH_ |SYNCH_ |
| | |BEGIN_EVT |END_EVT |
| | | | |
| V | V V
| +-------------+ +--------------+ +------------+
| | RESETTING |<---------| ESTABLISHED |--------->| PEER_RESET |
| +-------------+ FAILURE_ +--------------+ PEER_ +------------+
| | EVT | A RESET_EVT |
| | | | |
| | | | |
| | +--------------+ | |
| RESET_EVT| |RESET_EVT |ESTABLISH_EVT |
| | | | |
| | | | |
| V V | |
| +-------------+ +--------------+ RESET_EVT|
+--->| RESET |--------->| ESTABLISHING |<----------------+
+-------------+ PEER_ +--------------+
| A RESET_EVT |
| | |
| | |
|FAILOVER_ |FAILOVER_ |FAILOVER_
|BEGIN_EVT |END_EVT |BEGIN_EVT
| | |
V | |
+-------------+ |
| FAILINGOVER |<----------------+
+-------------+
These changes are fully backwards compatible.
Tested-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>