Speakers
Speakers
Jakub Kicinski
Matthieu Baerts
Willem de Bruijn
Kuniyuki Iwashima
Jason Xing
Eric Dumazet
Ido Schimmel
Sabrina Dubroca
Daniel Borkmann
John Fastabend
Alexander Lobakin
Paolo Abeni
QUIC in kernel
Jakub Kicinski
Notes: Paolo, Sabrina, Matt
Testing:
- New addition since last netconf
- tests run on VMs with 4 vCPUs
- Multiple VMs run in parallel, can be more than the actual number of CPUs on the physical system
- postgres db for the results + json cached (to help the webui)
- adding more tests may cause performance issues
- gathers all patches in patchwork and runs everything every 3 hours
- individual patch testing: only build + static testing
- running tests in parallel could help
- except the time is dominated by a handful of very slow selftests
- What is enabled? all debug options (KASAN, KMEMLEAK, SLAB debug)
- very slow, some results are flaky
- might be enough to check for stacktraces, use the non-debug run to look at the selftests results
- passing on slow kernels could be useful as a proxy for slow machines (embedded)
- can be quite a lot of work to have that open
- is there a value? People who are currently looking know the tests, and know when it is not a false-positive.
- move it out of BPF?
- too hard to run these tests
- might be good to move them to drivers/net (or hw)
- they can still be executed by the BPF CI
- Maybe they don't have a lot of values?
- Bootlin is currently work on moving old tests to "prog_tests"
YNL:
- lack of packaging: someone working on it
- lack of binding in other languages: only C and C++, but nothing coming.
- lack of testing?
- using YAML for syzbot coverage: there was some interest but no current work
- might be interesting to have it integrated into iproute2, because it is packaged everywhere:
- might be difficult to link dynamically with the ynl library
Private drivers (we cannot buy the HW in store):
- it might bring new APIs, that might be too specific, or maybe not → might bring some motivations (e.g. flowlabel)
Matthieu Baerts
Notes: Willem
Topic: Netdev CI and NIPA
Patchwork for maintainers, not developers:
- Don’t want people to start relying on the test infra for their correctness: test internally first
- Output can be unclear: not every failure is problematic. People might send too often, fixing unimportant things.
About sending emails: Question about BPF test infra: concerns about sending too many messages to the list. Unlike kbuild, these failures are not necessarily hard blockers.
Coverage: Are there subsystems that currently have little to no coverage? We don’t have GCC coverage instrumentation. For MPTCP, Paolo had a look, most of the regular (i.e., non error) was covered.
Flaky tests: are being ignored. How to get them to pass. Jakub sends a summary to the mailing list occasionally.
Reproduce locally: can use virtme-ng? Or build docker image. But may need bleeding edge versions of userspace tools, e.g., iproute2. Devs mention this in the commit message.
Is it useful to make this generally available? Yes, for reproducing issues, for developing on NIPA, and for presubmit testing before even sending patches upstream.
Funding: Allow testing before submission, may also make it easier to get funding from more companies. Nvidia already uses Intel’s infra (?). Waits for nightly results before submitting upstream. What is the overhead of setting up something like the BPF foundation? A meeting every two weeks.
How to expand to other subsystems: reuse NIPA, build from scratch? MPTCP uses Github Actions, non-trivial to setup. Could we have something reusable perhaps hosted by the Linux Foundation, like patchwork. What we want is a dashboard that shows KTAP: a matrix of tests * runs.
Stable versions: kselftests should support all previous versions. Linaro LKFT already runs kselftests from HEAD against stable kernels. Skip if a feature is not supported. Can be hard to do: example is packetdrill, where subtle changes deep in the TCP stack can affect behavior of many packetdrill tests. Alternative: send selftests that accompany bug fixes to stable. The tests are being accepted to stable too. So ask for an opt-out for this test-stable-from-HEAD policy for networking.
Test output: recommendation for new tests to generate KTAP. Can use ktap_helpers.sh for new shell-based tests. Or roll your own, the format is fairly straightforward: https://docs.kernel.org/dev-tools/ktap.html . But note that NIPA and LKFT parse TAP13, from which KTAP v1 slightly diverges.
Code reuse: move more code into net/lib or even under kselftests, like MPTCP code. Unclear who maintains the code under kselftests.
Willem de Bruijn
Notes: Paolo, Matthieu
Topic: Testing and scaling
Reviews:
- Use b4 diff for reviews, could be useful to integrate in patchwork or send as an email. Show diff between patch set versions.
- Matttbe: patchew does something similar, pulling patches directly from the ML (patchew.org). Less powerful than b4, does not apply patches, just do patch diffs
- Integration with patchwork could be problematic due to pw maintainer “latency”
Precondition checks in kernel code:
- Some functions have non-trivial pre-condition on the args/input (only non-GSO skbs in some code paths, etc). We can add pre-conditions admin debug check in the form of DEBUG_NET_WARN*. Useful, or pointless noise? Add "skb_assert_nofraglist"
- Some fields are not always initialized (i.e. mac_offset) hard to check.
Expanding kselftests: benchmarks and fault-injection:
- No perf-related self-tests in CI. Adding them is complex (platform H/W specific, possibly a lot of flakes, can have a lot of noise). A problem is also how to express perf-failure in ktap format.
- It is very interesting to share such tests: everybody is doing that on their side, probably the same tests, hard for someone new to start a similar project
- Paolo: Automated perf testing inside Red Hat: some flakiness, CPU usage is also compared
- tied to HW testing, perf tests need real NICs
- Can be interesting to publish the reports to increase the competition, and avoid regressions: on DPDK, they are doing something like that, so vendors want to have that fix before a release, and to collaborate with other vendors
- More fault injections functions, to increase error-paths coverage.
Scaling:
- Tools such as mpstat and ethtool -S were developed when machines had fewer CPUs and NIC queues, can’t show/cope well with large H/W, number of queues can be different from number of cores, but the code sometimes/somewhere does not really expect that.
- Joe: TLB flush/madvice is often a bottleneck for user-space
- Per device and per-cpu variables could use/waste a lot of memory
- We need good default configuration for rx/tx queue numbers, pinning.
- Daniel: tuned could be a starting point, but it currently does not cope with queues and IRQs
- Willem: google uses python script to generate shell boot script tailored to each platform, derived from platform-independent heuristics
- Dynamic queues allocation, useful in the container use-case. Containers care about queue more than devices (and RSS context). How to expose this API to user-space?
- Dedicated polling cores
- Device independent config state
- NAPI persistent ID, i.e. idpf
A lot of work in the shared memory area.
Kuniyuki Iwashima
Notes: Sabrina, Paolo
Topic: Per Netns RTNL
- some dumpit functions have been converted to lockless access
- some doit functions converted to lockless are also "get" functions
- creating lots of netns + some netdevices in each ns is very slow
- Solution: a per netns lock, but replacing all the rtnl_*lock calls is painful
- start by nesting the per-net lock under rtnl_lock, then remove the old lock when conversion is complete
- some doit handlers operate on multiple netns at once
- lock order helper to lock the pair of netns: init_net first, then compare addresses
- Add helpers do simplify the lock on multiple netns respecting such order
- Eric: make rtnl_net_lock take rtnl_lock during the conversion phase, to avoid having to rename then remove rtnl_lock to _deprecated:
- In some places, it is needed to have some code between the _deprecated and the net one
- But not a blocking issue: we can have a special __rtnl_lock… that could cover these special cases
- Note: it looks like it is not needed to introduce the _deprecated name:
- The current rtnl_lock is the deprecated one
- If the function name is changed, but the code behind is the same, that will make the backports to stable versions hard to do
- We need to ensure the ops will be alive around the netns lock/unlock; a get helper acquire the netns refcount and add the lock to the list
- Eric: rtnl_link_ops conversion: use SRCU?
- veth can have a peer, and macvlan (and other) have a child device. peer/child can be in a different netns.
- Jakub: the peer can be read from the core (there’s an ndo)
- macvlan: upper devices/ports can be in different netns
- unregister per netns: add devices to per-net list, then per-net unreg work
- Open questions on setlink/delink
- Jakub: locking code vs locking specific data structures
- Paolo: per-netdevice lock (initially small, but could cover more properties)
- Eric: won't help with spawning 10k netns
- Eric: how would for_each_net() work with per-netns RTNL? and notifiers?
- Eric: what about netns deletion?
- Kuniyuki: not the focus for now
- Eric: but it's a pain point for google
- cleanup_net vs module load, potential deadlocks
- Eric: large set of changes, and during the conversion phase everything will become slower
- or per-netns lock can be a noop until rtnl_lock gets removed
- DEBUG kernels have the lock so we can use LOCKDEP to detect lock ordering issues
- Jakub: can we pull some operations out of rtnl_lock?
- Eric: large pcpu allocs (SNMP) during netns initialization are slow
- Eric: prefer mutex to spinlock in any context we can sleep (even for small sections)
- and there are lockless lists as well
- Eric: pcpu allocator uses a spinlock
- Paolo: sysfs interfaces use rtnl_try_lock, which makes the contention worse
Jason Xing
Topic: Extending SO_TIMESTAMPING feature
- useful to detect latency
- Some measurable overhead and possibly changes required to existing applications
- Use a bpf program to do the required setsockopt without app changes
- Use kprobes to fetch the timestamp without additional syscalls
- Since no recvmsg() consumes the modification will hit sk_rmem_alloc() limits at some point
- Possible alternative: use tracepoints
- Quite noisy, will trace every applications
- A new setsockopt could be used to set a per socket flag to enable the tracepoints
- It’s tricky to look at the skb payload, if needed
- Can’t enable both per application timestamping (via the existing API) and this ‘admin’ timestamping simultaneously.
Willem/John: use a bpf cgroup hook to attach a bpf program to collect aggregate stats/histograms (we do not want to push to userspace the full tx stream),
The google use case is to track (sampling) the lifecycle, latency, etc of RPC across different hosts
Jakub: we need an end-to-end implementation, from packet transmission to actual timestamp collection and usage.
Eric: the timestamping APIs are inherently racy because it touches socket fields and can’t acquire the socket lock
Eric Dumazet
Topic: UDP, TCP
UDP
- ISC: security flaw: spoofed source address to the host's address
- neigh code drops the packet
- server gets blocked when the queue gets full
- router should be blocking those incoming packets with RPF
TCP socket lock
- regression in tcpv6_rcv: spending more time acquiring the lock
- TX takes the spinlock only to set owned_by_user, only RX and timers take the spinlock
- timers don't get canceled, they run and the handler acquires the lock to check if there's anything to do
- things done under lock (needlessly, or much longer than needed)
- waking up a process (and it is expensive)
- handling ACKs (kfree_skb causes contention on kmem_cache)
- some state shouldn't be in the socket (and protected by the socket lock)
- maybe percpu (rt guys won't like this) or per-thread data
- delay wakeups, kfree_skb to after unlock
- prepare skb reply, and after spin_unlock send it
BIG TCP
- Forward packets from host to VMs that may not support BIG TCP
- Add a flag to virtio to negotiate
- virtio header has a 16b length field for each packet, would require a spec change
syzbot
- not a lot of dupes anymore, mostly good reports, very useful
- Eric triages so only networking bugs get reported to netdev
Jumbo and beyond?
- 4k was good for zerocopy
- ACKs don’t get sent immediately if packets are smaller than MTU, some workloads don’t like larger MTUs
Ido Schimmel
Topic: DSCP matching, IPMR extension, XDP metadata for telemetry
Multipath hash seed:
- Depending on the use case admin may want the multipath hash being different or equal on different hosts: need to set the hash seed from user-space
Transceiver module firmware update:
DSCP matching:
- TOS is minefield with overlapping contradictory bit allocations over the years, and the kernel still using very old macros to touch it.
- Needed to refactor the masking in the core and expose a dscp selector to the fib engine
XDP metadata for telemetry:
- Psample can sample packets at the tc level passing to user-space via NL some additional metadata
- XDP could speed-up the process, will need some additional metadata
Drop reasons
- generic (reusable) vs specific (easier to understand the problem)
- Paolo: good too specific is not useful, since we know the callsite
- we have location + stack trace
- Eric: check the compile time if we add lots of drop reasons, since that file is included everywhere
- do we need to annotate everything, or leave some unspecified?
- renames will break users, but they’re not considered uapi
- Jakub: renumbering is fine. rename/remove: try and see if people complain
Sabrina Dubroca
Topic: macsec offload, semi-automated validation for uapi patches
macsec offload:
- The core need to know the packet number (sequence chosen by the HW) to do the rekeying, no generic way for the driver to tell it to the core
- Some drivers don’t update this packet number
- WPA supplicant do the rekey based on the interface stats
- Eric: use counters (from the stats) / estimations if the drivers doesn’t provide anything
- Jakub: write a test for HW: new implementations have to pass that test
- Two ways for the offload: PHY and MAC:
- Userspace has to choose one, no (current) way to know what’s available
- ANY: kernel chooses / tries: but that will only work for newer kernels
- Or the userspace does that because they have to do that with older kernels anyway: try one, then the other one
- Old drivers do not provide metadata: can’t tell the core if the rx packet is eth or h/w decrypted macsec, nor the macsec channels
- The core workaround broke isolation for bcast packets
- Enforce new drivers to support the metadata, e.g. a selftest that has to pass
- What about old drivers that don’t/cannot support that?
- Eric: disable this offload feature on old drivers, but have a way (static key) for the user to enable it: a way to warn the user it is not secure
- Paolo: probably best to talk to the vendors about that, maybe they can fix the drivers, if it is possible with the HW
semi-automated validation for UAPI patches:
- It is “easy” to break the UAPI, and miss that in the reviews, e.g. new entries in the middle of the struct, or change the size of an item, etc.
- Tools can help detecting the problems, looking for struct layout or define changes, the problem detection heuristic can be somewhat prone to false positive or negative, we need to draw a line on what is acceptable
- Might be enough to generate to pahole diff, and add a link on Patchwork
- + a warning (similar to checkpatch, e.g. when adding a new file) when the UAPI is modified, and reviewers can check that.
Daniel Borkmann
Topic: Overlapping TProxy with non-local bind + Wireguard + BIG TCP
Overlapping TProxy with non-local bind:
- The proposal looks good for Eric and Kuniyuki
Benchmarking over the wire for wireguard, cow data on decrypt looks very expansive in the flamegraph:
- especially visible on AMD cpus?
- skb_cow_data() is quite heavy, but because of the clear_page_rep() being done:
- Eric: why is this page zeroed? Maybe a security feature forcing that? Check the config option (INIT_ON_ALLOC_DEFAULT_ON)
Big TCP:
- Eric: best to have a fast path: only do the v4 check if the previous one failed
- Eric: if the cacheline is an issue, the field can also be moved (some other fields will be moved later on as well)
John Fastabend
Topic: DPU use cases for Cilium, proposed programming API
DPUs have a bunch of general purpose CPUs used to offer different services (ipsec,tls,...)
Exposed in vendor specific ways (sdk)
Frequent use-case: custom hdr above UDP with sequence nr and token. Such proto/hdr is not going to be added to the kernel, need to filter/do admission control based on custom hdr fields.
The usual workflow is: write the parser in DSL, configure the device, compile the parser for the backend (device specific) upload and reboot.
Want to avoid another SW layer in the kernel to emulate the HW.
Want to build:
- Load balancers (data traffic does not reach the host)
- Firewalls
- RSS
- Crypto services
The problem is how to express tcam entries manipulation, e.g. p4, ovs, something similar
Willem: what about the update: is the interface going to be a bottleneck? Perhaps direct mapping of the CAM could/should be needed?
Willem: the DPU could be modeled as a switch with multiple ports (host side/wire side)
You can use the free cores on DPUs for more funky stuff like TCP reordering, but we don’t need kernel interfaces for such kind of stuff - it’s like s/w running on a different host.
Alexander Lobakin
Topic: netdevice related
- netdev_from_priv() to reduce overhead and boilerplates in the drivers:
- Drop the backpointer from priv data to net_dev, use a generic helper instead, as the core struct is always a fixed position WRT the private data.
- Will save a little data and code per driver.
- Seems fair
- synchronizing ETH_SS_STATS with changing the number of queues (via timers):
- Per queue stats require 3 calls, and the number of queue has to stay consistent in-between or bad things will happen
- But we may move away from ethtool stats per queue, so this problem should go away for new drivers
- Perhaps add some documentation for the above
- If we move some netdev flags to priv, we will need inheriting and incrementing private flags (just like we do for the features), candidates:
- GRO in cpu maps; gro and napi are not necessary together, the idea would be to decapsulate the GRO helpers from napi, so that any entity other than NAPI could do GRO, too, e.g. CPU maps, and perhaps later GRO cells
- Eric: moving the GRO bits out of napi could change the napi struct layout, must be careful to avoid breaking cacheline-based optimizations
- HSR consumes 4 bits, very little usage for that
Paolo Abeni
Topic: Dealing with double/nested UDP tunnel
- Use case: containers hosted in VMs
- Correctly handled by the stack, but not good in terms of performances, e.g. no GSO, HW encryption, etc.
- GSO: add UDP_TUNNEL_* as GSO partial
- But then TX/RX asymmetry (missing GRO)
- Idea: replace multiple geneve encaps with a single geneve encap with options indicating the extra levels of encap
- Problems
- an RX node that doesn’t understand the new option will drop packets
- we need to detect that and disable use of the option
- with GSO: headers contained in the option (2nd layer encap and inner payload) will have invalid length/checksum
- need to set them correct when doing the first decap (convert the option into the real headers)
- Eric: how are geneve options used today?
- passed to ovs userspace
- kernel currently doesn’t use them, except maybe with md_dst/tc (opaque value)
- Willem: why can’t GSO support multiple encaps?
- it needs to know the offsets of the inner headers
- Willem: a generic solution would have value for non-geneve multiple encaps
- Eric: GRO doesn’t need to store all its offsets in the CB, it could be a scratch area
- Willem: GSO could parse all the headers
- Paolo: no, because it needs to identify the socket, which could be in a different netns
- Jakub: do you need to reserve the option with IETF?
- Paolo: custom options are ok with geneve
- Sabrina: standardize that option + the detection mechanism together?
- Jakub: don’t put the inner encaps in the option, only the length of those encaps
- Paolo: problem on segmentation
- Jakub: use GSO partial on TX, and this on RX
- Paolo: has to be non-critical option (we currently drop all critical options since we don’t implement any)
QUIC in kernel
- Lots of code, very complex
- QUIC was intended to be implemented in userspace so it’s easy to deploy new versions
- Handshake in userspace (net/handshake)
- For kernel consumers (NFS/SMB)
- Pluggable congestion control using TCP’s?
- No, it reimplements CUBIC in net/quic