Speakers

What is enabled? all debug options (KASAN, KMEMLEAK, SLAB debug)
very slow, some results are flaky
might be enough to check for stacktraces, use the non-debug run to look at the selftests results
passing on slow kernels could be useful as a proxy for slow machines (embedded)

rules: https://lore.kernel.org/all/20240425114200.3effe773@kernel.org
Some external (contractor) companies could help to put that in place
Might be important to keep the devs involved, not just the QA team
Some companies are doing some tests (even quite a lot of tests), but not letting access:

can be quite a lot of work to have that open
is there a value? People who are currently looking know the tests, and know when it is not a false-positive.

XDP:

move it out of BPF?
too hard to run these tests
might be good to move them to drivers/net (or hw)
they can still be executed by the BPF CI
Maybe they don't have a lot of values?
Bootlin is currently work on moving old tests to "prog_tests"

YNL:

lack of packaging: someone working on it
lack of binding in other languages: only C and C++, but nothing coming.
lack of testing?
using YAML for syzbot coverage: there was some interest but no current work
might be interesting to have it integrated into iproute2, because it is packaged everywhere:
might be difficult to link dynamically with the ynl library

Private drivers (we cannot buy the HW in store):

it might bring new APIs, that might be too specific, or maybe not → might bring some motivations (e.g. flowlabel)

Matthieu Baerts

Notes: Willem

Topic: Netdev CI and NIPA

Patchwork for maintainers, not developers:

Don’t want people to start relying on the test infra for their correctness: test internally first
Output can be unclear: not every failure is problematic. People might send too often, fixing unimportant things.

About sending emails: Question about BPF test infra: concerns about sending too many messages to the list. Unlike kbuild, these failures are not necessarily hard blockers.

Coverage: Are there subsystems that currently have little to no coverage? We don’t have GCC coverage instrumentation. For MPTCP, Paolo had a look, most of the regular (i.e., non error) was covered.

Flaky tests: are being ignored. How to get them to pass. Jakub sends a summary to the mailing list occasionally.

Reproduce locally: can use virtme-ng? Or build docker image. But may need bleeding edge versions of userspace tools, e.g., iproute2. Devs mention this in the commit message.

Is it useful to make this generally available? Yes, for reproducing issues, for developing on NIPA, and for presubmit testing before even sending patches upstream.

Funding: Allow testing before submission, may also make it easier to get funding from more companies. Nvidia already uses Intel’s infra (?). Waits for nightly results before submitting upstream. What is the overhead of setting up something like the BPF foundation? A meeting every two weeks.

How to expand to other subsystems: reuse NIPA, build from scratch? MPTCP uses Github Actions, non-trivial to setup. Could we have something reusable perhaps hosted by the Linux Foundation, like patchwork. What we want is a dashboard that shows KTAP: a matrix of tests * runs.

Stable versions: kselftests should support all previous versions. Linaro LKFT already runs kselftests from HEAD against stable kernels. Skip if a feature is not supported. Can be hard to do: example is packetdrill, where subtle changes deep in the TCP stack can affect behavior of many packetdrill tests. Alternative: send selftests that accompany bug fixes to stable. The tests are being accepted to stable too. So ask for an opt-out for this test-stable-from-HEAD policy for networking.

Test output: recommendation for new tests to generate KTAP. Can use ktap_helpers.sh for new shell-based tests. Or roll your own, the format is fairly straightforward: https://docs.kernel.org/dev-tools/ktap.html . But note that NIPA and LKFT parse TAP13, from which KTAP v1 slightly diverges.

Code reuse: move more code into net/lib or even under kselftests, like MPTCP code. Unclear who maintains the code under kselftests.

Willem de Bruijn

Notes: Paolo, Matthieu

Topic: Testing and scaling

Reviews:

Use b4 diff for reviews, could be useful to integrate in patchwork or send as an email. Show diff between patch set versions.
Matttbe: patchew does something similar, pulling patches directly from the ML (patchew.org). Less powerful than b4, does not apply patches, just do patch diffs
Integration with patchwork could be problematic due to pw maintainer “latency”

Precondition checks in kernel code:

Some functions have non-trivial pre-condition on the args/input (only non-GSO skbs in some code paths, etc). We can add pre-conditions admin debug check in the form of DEBUG_NET_WARN*. Useful, or pointless noise? Add "skb_assert_nofraglist"
Some fields are not always initialized (i.e. mac_offset) hard to check.

Expanding kselftests: benchmarks and fault-injection:

No perf-related self-tests in CI. Adding them is complex (platform H/W specific, possibly a lot of flakes, can have a lot of noise). A problem is also how to express perf-failure in ktap format.
It is very interesting to share such tests: everybody is doing that on their side, probably the same tests, hard for someone new to start a similar project
Paolo: Automated perf testing inside Red Hat: some flakiness, CPU usage is also compared
tied to HW testing, perf tests need real NICs
Can be interesting to publish the reports to increase the competition, and avoid regressions: on DPDK, they are doing something like that, so vendors want to have that fix before a release, and to collaborate with other vendors
More fault injections functions, to increase error-paths coverage.

Scaling:

Tools such as mpstat and ethtool -S were developed when machines had fewer CPUs and NIC queues, can’t show/cope well with large H/W, number of queues can be different from number of cores, but the code sometimes/somewhere does not really expect that.
Joe: TLB flush/madvice is often a bottleneck for user-space
Per device and per-cpu variables could use/waste a lot of memory
We need good default configuration for rx/tx queue numbers, pinning.

Daniel: tuned could be a starting point, but it currently does not cope with queues and IRQs
Willem: google uses python script to generate shell boot script tailored to each platform, derived from platform-independent heuristics

Dynamic queues allocation, useful in the container use-case. Containers care about queue more than devices (and RSS context). How to expose this API to user-space?

Dedicated polling cores
Device independent config state

NAPI persistent ID, i.e. idpf

A lot of work in the shared memory area.

Kuniyuki Iwashima

Notes: Sabrina, Paolo

Topic: Per Netns RTNL

some dumpit functions have been converted to lockless access
some doit functions converted to lockless are also "get" functions
creating lots of netns + some netdevices in each ns is very slow
Solution: a per netns lock, but replacing all the rtnl_*lock calls is painful

start by nesting the per-net lock under rtnl_lock, then remove the old lock when conversion is complete

some doit handlers operate on multiple netns at once

lock order helper to lock the pair of netns: init_net first, then compare addresses
Add helpers do simplify the lock on multiple netns respecting such order

Eric: make rtnl_net_lock take rtnl_lock during the conversion phase, to avoid having to rename then remove rtnl_lock to _deprecated:

In some places, it is needed to have some code between the _deprecated and the net one
But not a blocking issue: we can have a special __rtnl_lock… that could cover these special cases
Note: it looks like it is not needed to introduce the _deprecated name:

The current rtnl_lock is the deprecated one
If the function name is changed, but the code behind is the same, that will make the backports to stable versions hard to do

We need to ensure the ops will be alive around the netns lock/unlock; a get helper acquire the netns refcount and add the lock to the list

Eric: rtnl_link_ops conversion: use SRCU?

veth can have a peer, and macvlan (and other) have a child device. peer/child can be in a different netns.

Jakub: the peer can be read from the core (there’s an ndo)

macvlan: upper devices/ports can be in different netns

unregister per netns: add devices to per-net list, then per-net unreg work

Open questions on setlink/delink
Jakub: locking code vs locking specific data structures
Paolo: per-netdevice lock (initially small, but could cover more properties)

Eric: won't help with spawning 10k netns

Eric: how would for_each_net() work with per-netns RTNL? and notifiers?
Eric: what about netns deletion?

Kuniyuki: not the focus for now
Eric: but it's a pain point for google
cleanup_net vs module load, potential deadlocks

Eric: large set of changes, and during the conversion phase everything will become slower

or per-netns lock can be a noop until rtnl_lock gets removed
DEBUG kernels have the lock so we can use LOCKDEP to detect lock ordering issues

Jakub: can we pull some operations out of rtnl_lock?

Eric: large pcpu allocs (SNMP) during netns initialization are slow

Eric: prefer mutex to spinlock in any context we can sleep (even for small sections)

and there are lockless lists as well

Eric: pcpu allocator uses a spinlock
Paolo: sysfs interfaces use rtnl_try_lock, which makes the contention worse

Jason Xing

Topic: Extending SO_TIMESTAMPING feature

useful to detect latency
Some measurable overhead and possibly changes required to existing applications
Use a bpf program to do the required setsockopt without app changes
Use kprobes to fetch the timestamp without additional syscalls

Since no recvmsg() consumes the modification will hit sk_rmem_alloc() limits at some point

Possible alternative: use tracepoints

Quite noisy, will trace every applications
A new setsockopt could be used to set a per socket flag to enable the tracepoints
It’s tricky to look at the skb payload, if needed
Can’t enable both per application timestamping (via the existing API) and this ‘admin’ timestamping simultaneously.

Willem/John: use a bpf cgroup hook to attach a bpf program to collect aggregate stats/histograms (we do not want to push to userspace the full tx stream),

The google use case is to track (sampling) the lifecycle, latency, etc of RPC across different hosts

Jakub: we need an end-to-end implementation, from packet transmission to actual timestamp collection and usage.

Eric: the timestamping APIs are inherently racy because it touches socket fields and can’t acquire the socket lock

Eric Dumazet

Topic: UDP, TCP

UDP

ISC: security flaw: spoofed source address to the host's address
neigh code drops the packet
server gets blocked when the queue gets full
router should be blocking those incoming packets with RPF

TCP socket lock

regression in tcpv6_rcv: spending more time acquiring the lock
TX takes the spinlock only to set owned_by_user, only RX and timers take the spinlock
timers don't get canceled, they run and the handler acquires the lock to check if there's anything to do
things done under lock (needlessly, or much longer than needed)

waking up a process (and it is expensive)
handling ACKs (kfree_skb causes contention on kmem_cache)

some state shouldn't be in the socket (and protected by the socket lock)

maybe percpu (rt guys won't like this) or per-thread data

delay wakeups, kfree_skb to after unlock
prepare skb reply, and after spin_unlock send it

BIG TCP

Forward packets from host to VMs that may not support BIG TCP
Add a flag to virtio to negotiate
virtio header has a 16b length field for each packet, would require a spec change

syzbot

not a lot of dupes anymore, mostly good reports, very useful
Eric triages so only networking bugs get reported to netdev

Jumbo and beyond?

4k was good for zerocopy
ACKs don’t get sent immediately if packets are smaller than MTU, some workloads don’t like larger MTUs

Ido Schimmel

Topic: DSCP matching, IPMR extension, XDP metadata for telemetry

Multipath hash seed:

Depending on the use case admin may want the multipath hash being different or equal on different hosts: need to set the hash seed from user-space

Transceiver module firmware update:

No rtnl lock here!

DSCP matching:

TOS is minefield with overlapping contradictory bit allocations over the years, and the kernel still using very old macros to touch it.
Needed to refactor the masking in the core and expose a dscp selector to the fib engine

XDP metadata for telemetry:

Psample can sample packets at the tc level passing to user-space via NL some additional metadata
XDP could speed-up the process, will need some additional metadata

Drop reasons

generic (reusable) vs specific (easier to understand the problem)

Paolo: good too specific is not useful, since we know the callsite
we have location + stack trace
Eric: check the compile time if we add lots of drop reasons, since that file is included everywhere

do we need to annotate everything, or leave some unspecified?
renames will break users, but they’re not considered uapi

Jakub: renumbering is fine. rename/remove: try and see if people complain

Sabrina Dubroca

Topic: macsec offload, semi-automated validation for uapi patches

macsec offload:

The core need to know the packet number (sequence chosen by the HW) to do the rekeying, no generic way for the driver to tell it to the core
Some drivers don’t update this packet number
WPA supplicant do the rekey based on the interface stats
Eric: use counters (from the stats) / estimations if the drivers doesn’t provide anything
Jakub: write a test for HW: new implementations have to pass that test
Two ways for the offload: PHY and MAC:

Userspace has to choose one, no (current) way to know what’s available
ANY: kernel chooses / tries: but that will only work for newer kernels
Or the userspace does that because they have to do that with older kernels anyway: try one, then the other one

Old drivers do not provide metadata: can’t tell the core if the rx packet is eth or h/w decrypted macsec, nor the macsec channels

The core workaround broke isolation for bcast packets
Enforce new drivers to support the metadata, e.g. a selftest that has to pass
What about old drivers that don’t/cannot support that?
Eric: disable this offload feature on old drivers, but have a way (static key) for the user to enable it: a way to warn the user it is not secure
Paolo: probably best to talk to the vendors about that, maybe they can fix the drivers, if it is possible with the HW

semi-automated validation for UAPI patches:

It is “easy” to break the UAPI, and miss that in the reviews, e.g. new entries in the middle of the struct, or change the size of an item, etc.
Tools can help detecting the problems, looking for struct layout or define changes, the problem detection heuristic can be somewhat prone to false positive or negative, we need to draw a line on what is acceptable
Might be enough to generate to pahole diff, and add a link on Patchwork

+ a warning (similar to checkpatch, e.g. when adding a new file) when the UAPI is modified, and reviewers can check that.

Daniel Borkmann

Topic: Overlapping TProxy with non-local bind + Wireguard + BIG TCP

Overlapping TProxy with non-local bind:

The proposal looks good for Eric and Kuniyuki

Benchmarking over the wire for wireguard, cow data on decrypt looks very expansive in the flamegraph:

especially visible on AMD cpus?
skb_cow_data() is quite heavy, but because of the clear_page_rep() being done:

Eric: why is this page zeroed? Maybe a security feature forcing that? Check the config option (INIT_ON_ALLOC_DEFAULT_ON)

Big TCP:

gso_features_check():

Eric: best to have a fast path: only do the v4 check if the previous one failed
Eric: if the cacheline is an issue, the field can also be moved (some other fields will be moved later on as well)

John Fastabend

Topic: DPU use cases for Cilium, proposed programming API

DPUs have a bunch of general purpose CPUs used to offer different services (ipsec,tls,...)

Exposed in vendor specific ways (sdk)

Frequent use-case: custom hdr above UDP with sequence nr and token. Such proto/hdr is not going to be added to the kernel, need to filter/do admission control based on custom hdr fields.

The usual workflow is: write the parser in DSL, configure the device, compile the parser for the backend (device specific) upload and reboot.

Want to avoid another SW layer in the kernel to emulate the HW.

Want to build:

Load balancers (data traffic does not reach the host)
Firewalls
RSS
Crypto services

The problem is how to express tcam entries manipulation, e.g. p4, ovs, something similar

Willem: what about the update: is the interface going to be a bottleneck? Perhaps direct mapping of the CAM could/should be needed?

Willem: the DPU could be modeled as a switch with multiple ports (host side/wire side)

You can use the free cores on DPUs for more funky stuff like TCP reordering, but we don’t need kernel interfaces for such kind of stuff - it’s like s/w running on a different host.

Alexander Lobakin

Topic: netdevice related

netdev_from_priv() to reduce overhead and boilerplates in the drivers:

Drop the backpointer from priv data to net_dev, use a generic helper instead, as the core struct is always a fixed position WRT the private data.
Will save a little data and code per driver.
Seems fair

synchronizing ETH_SS_STATS with changing the number of queues (via timers):

Per queue stats require 3 calls, and the number of queue has to stay consistent in-between or bad things will happen
But we may move away from ethtool stats per queue, so this problem should go away for new drivers
Perhaps add some documentation for the above

If we move some netdev flags to priv, we will need inheriting and incrementing private flags (just like we do for the features), candidates:

Vlan_challenged
Highdma

GRO in cpu maps; gro and napi are not necessary together, the idea would be to decapsulate the GRO helpers from napi, so that any entity other than NAPI could do GRO, too, e.g. CPU maps, and perhaps later GRO cells

Eric: moving the GRO bits out of napi could change the napi struct layout, must be careful to avoid breaking cacheline-based optimizations
HSR consumes 4 bits, very little usage for that

Paolo Abeni

Topic: Dealing with double/nested UDP tunnel

Use case: containers hosted in VMs
Correctly handled by the stack, but not good in terms of performances, e.g. no GSO, HW encryption, etc.
GSO: add UDP_TUNNEL_* as GSO partial

But then TX/RX asymmetry (missing GRO)

Idea: replace multiple geneve encaps with a single geneve encap with options indicating the extra levels of encap
Problems

an RX node that doesn’t understand the new option will drop packets

we need to detect that and disable use of the option

with GSO: headers contained in the option (2nd layer encap and inner payload) will have invalid length/checksum

need to set them correct when doing the first decap (convert the option into the real headers)

Eric: how are geneve options used today?

passed to ovs userspace
kernel currently doesn’t use them, except maybe with md_dst/tc (opaque value)

Willem: why can’t GSO support multiple encaps?

it needs to know the offsets of the inner headers
Willem: a generic solution would have value for non-geneve multiple encaps
Eric: GRO doesn’t need to store all its offsets in the CB, it could be a scratch area
Willem: GSO could parse all the headers

Paolo: no, because it needs to identify the socket, which could be in a different netns

Jakub: do you need to reserve the option with IETF?

Paolo: custom options are ok with geneve
Sabrina: standardize that option + the detection mechanism together?

Jakub: don’t put the inner encaps in the option, only the length of those encaps

Paolo: problem on segmentation
Jakub: use GSO partial on TX, and this on RX
Paolo: has to be non-critical option (we currently drop all critical options since we don’t implement any)

QUIC in kernel

Lots of code, very complex
QUIC was intended to be implemented in userspace so it’s easy to deploy new versions
Handshake in userspace (net/handshake)
For kernel consumers (NFS/SMB)

We have fuse

Pluggable congestion control using TCP’s?

No, it reimplements CUBIC in net/quic