NetConf 2023

Day 1 - 25th September 2023

Alexei - BPF and Networking Conferences

No slides
It is nice to have NetConf back after a 3 year hiatus!
LPC and Netdev are bigger events, but not coordinated with each other
Willem:

LPC, NetConf and Netdev.conf are too close to each other
2 conferences a year are fine, for each of BPF and Networking
But they should be spread out

Dave M

Uneasy about merging NetConf and Netdev.conf
Doesn’t want to lead liaison, but happy for someone else to step up
Happy to have NetConf colocated with Kernel Recipes

David A

Netdev typically piggy-backs on a larger conference (restricts timing options)

Willem

LPC has the value of other tracks

Alexei

LPC is a large conference. 500 attendees. Always sells out

David A

Will reach out to Jamal (Netdev)
Proximity to LPC is a key issue, but there are many factors to consider

Jakub

LPC track is really BPF
Should there be a separate Networking track?

David A

Netdev.conf is useful regardless, it tries to reach out to universities etc. Has a broader spectrum of attendees.

Alexei

Plumbers was supposed to be about discussion rather than lectures
But perhaps it is too big for discussion
Very few people are volunteering talks; something is wrong
Need to tweak the format, but how?

David M, Jiri, Alexi, Jakub

Some discussion of merging Netdev into LPC
Issue may be control for existing Netdev organisers
Came back to key issue is proximity of LPC and Netdev
Some discussion of attaching NetConf to LSFM

Dave M and others

Some discussion of back-to-back with Kernel Recipes instead of overlapping
People want to attend Kernel Recipes
But there could be extra overhead in organisation
And 5 consecutive days “wipes people out”

Conclusion (Daniel/Jakub):

Having two networking conferences a year would be preferred and venues could be netdev.conf (spring) and LPC (autumn). Needs coordination with Jamal.

Needs to ensure it doesn’t overlap with LSF/MM/BPF which is spring too

For future LPC, it would be good to have BPF and Network track as two separate entities rather than merged.

Perhaps LPC could have submission type talk vs discussion

Action: try to propose moving Netdev.conf out to spring / early summer to space the conferences out [David Ahern]

Alexei - Tree Management

No Slides
Most patches go through sub trees, get some soaking before hitting net-next
But some patches go directly into net-next, which can lead to breakage, which affects others
Proposes that net-next only be an integration tree
Saeed, proposes new integration branch
Florian W, proposes an acceptance branch
Netfilter has done this with ‘testing’ branch
Bots and so on often catch problems there
Jakub, key is to have CI, and enforce its use (reject patches if needed). If we don’t have CI, tree shuffling will not help. If we do, we can run it without changing trees/process.
Paolo, problem with CI, is we don’t have coverage, and it will take a long time
Jakub, if we have coverage we can just run it on linux-next…
Paolo, it is a big if. Key is to increase coverage
Jakub: need to ensure that tests are running. Need infrastructure.
A lot of discussion of the difficulties of making CI available, reliable…
Alexei: it is a full time job for a team at Meta to run BPF CI
Saeed, also have a large CI team at Nvidia
Action: build automation to collect pending patches from the list and put them on a branch so people can run their CIs on it [Jakub]

Toke - XDP: Past, Present and Future

Has slides
Summarised features since last NetConf (slide)

Map lookup, hashmap type, bpf_redirect performance

Jesper would like to see users before optimising
Need_wakeup mode

User-space app running AF_XDP on same CPU as network
Avoids syscalls as much as possible

Summarised ongoing work (slide)

Veth optimisations

Jesper: Aim to avoid naive implementations that hurt performance
Toke: turns out that people like containers
Then they want SR-IOV
Then they want policy
Possible to apply policy in ns of physical device
Then run AF_XDP in tenant container
Performance is ~8Mpps

Alexander adds

AF_XDP for virtio-net
RFPC to redirect frames to crypto engine

XDP driver support (slide)

39 drivers support XDP
0 drivers enable all features by default. Mlx5 is probably closest

This is a barrier to use

Toggle would be a good step
Which could be adjusted when XDP programs are loaded
Common defaults too

Have had some success in folding features into “basic”
⚑ Simon will check if nfp has redirect support

David A asks how many people know about / use monitor
Alexei proposes a strawman: remove drivers that don’t support XDP. His real point was to deprecate unused drivers.
Some discussion of why WiFi doesn’t support XDP

Currently used for high performance data centre use cases
Would like to see it move to generic tool for fastpath
Currently proprietary hw offloads are used instead

A migration path from DPDK is to use DPDK AF_XDP driver

Key issue with DPDK is deployment model

AF_XDP xmit path (slide)

Address problems in non-zero copy mode; people care about this

F.e. containers via veth

Willem: some (Intel) drivers require XDP program to be loaded to use AF_XDP, but this doesn’t make sense

Jesper: can be addressed now that we have features

XDP Hints (slide)

The term metadata is overloaded/meaningless
Saving metadata would be opt-in

Datapath helpers (slide)

Willem, GRO is a great case for XDP
It is one of the few packet parsers in the kernel
But moving it further down the stack seems hard
Toke, need a place to buffer XDP frames
Would enable many interesting things

Queueing xmit side (slide)

Eric feels that kthread option might make sense

Jesper - Page pool

Has slides
Recycling is now everywhere
Have fragmenting pages. Moves away from one page per packet
Recent proposal: hide pp_frag_count

Page pool no longer handles pages?
Motivation is for Arm64 with 16K pages, one page per packet is wasteful
Maybe rename to netmem or something else

Jesper’s concerns

Have got good performance by addressing specific cases
Generalising may partially reverse this
Suggests adding more memory allocator types

Jakub, Willem, Jesper

Discussed returning non-pages
Currently overloading bit in skb
Store in frags with page pointer with magic bits set

Jakub

The API is growing, makes it difficult to provide alternate implementation / replacement that’s not page-backed

David A

Asks if oot benchmarks can be easily changed to use multiple buffers per page
Jesper: can be easily done
David: would offer some insights regarding usefulness of such an approach

Alexander - GRO Overhead

No slides
For each packet we create SKB, feed into GRO, then just store head and pass to NAPI cache
May be better to pass XDP data to GRO; quickly pass frame to GRO layer
Willem, we already have NAPI GRO by frags
Eric, but still need to allocate/initialise the skb
David A, would be a trade off, as you’d need to iterate over a NAPI-budget worth of frames, as they are related to each other
Jesper, only need a small amount of information from hw to feed into hw.
Implementation is a challenge
Also agrees with Eric that benchmarking is required
Two options

XDP aware GRO
GRO implemented (partially) in XDP

Jesper, believe GRO is overhead. ~15%
Alexander. Most drivers pass packet type (TCP, …) in the descriptor. Could pass this into stack
Quite some discussion about HW GRO, hints, XDP GRO, …

Willem describes that sufficient state to feed hings into sw GRO is too large
Eric describes scheme used by Google
HW provides 16bits metadata: start (1bit); end (1bit); flowid (14bits)
Driver assembles skb based on this information, sw GRO is mostly unused
Jakub, this is really just a GRO-HW implementation if HW does full matching

Io_uring has locking scheme to allow use of user pages by kernel and handling the case where the user space process terminates
Christoph Hellwig (not present) has complained that page pool should use DMA APIs to avoid starving DMA
Eric, if a hugepage is used to store many frags, then all frags share a single lock

Jesper - ksoftirq

Has slides
“softirq: Let Ksoftirqd do its job” was reverted in May 2023

Original patch solved UDP overload case
But other cases became apparent over time; revert was probably the best option

Cloudflare see less time in softirq with patch, but innocent userspace threads suffer
An option is to use a different API

Multi UDP-message recv doesn’t work; just loops over each message
UDP GRO (pending testing)
Io_uring UDP doesn’t help

Eric, batching UDP across user-space/kernel will shift problem but not solved
Changelog of revert describes how to resolve problem with modern kernels
Willem, effects any workload where a significant amount of traffic doesn’t have flow control
Jakub, in that case busy-poll is a solution
Jesper, has seen this used in conjunction with AF_XDP
Eric, the original change was all about latency (it made latency worse).
Need to measure latency
Question about using RPS with threaded NAPI. Currently doesn't work, would require the threaded backlog patches proposed by Sebastian.

Paolo - Inclusive Language and more

Have slides

Slaves no more

Deprecate problematic API
Add new enums/defines with inclusive names and values
Sabrina, Eric have concerns about backporting
Saeed asks about deprecating old code
Paolo, initially just add duplicate uAPI
Later reduce usage of non-inclusive language inside in kernel
Daniel asks if there is a kernel-wide policy; how other subsystems handle this, maybe by LF TAB?
Florian F and others. SPI is an example, a rename has occurred there
Jakub asks if the size of the problem has been quantified
Paolo. For uAPI there is bonding, bridge and team. But inside kernel usage is more widespread

GRO can be CPU heavy

Packet header parsing
Code is scattered
Eric, need to prefetch the next packet. Current packet is not enough
GRO is expensive, because it is the first layer to touch data, which is also touched by other layers
Paolo, an option is to make monolithic GRO for TCP (and maybe UDP) with fallback to current model
No data to back this idea at this time
Jesper, could instrument with perf stat
Eric, at high packet rates all the cache lines should be hot
Paolo, another idea is to use a single structure for L3 and L4 callback
Eric feels it would not make a great deal of difference, given the cache size of modern CPUs
Plenty of other places to reduce cache line usage

Syscall using a socket id.

Struct file is 256 bytes
Operation requires 4 cache line

Long discussion of reordering structures to optimise for cache lines, and avoiding regressions
Willem raised the issue of dcache vs icache pressure. Adding extra code to avoid a cache-line miss may not be a win
Jesper suggests using perf to instrument dcache and icache usage

IP Fragmentation -> pain

Current default settings are problematic
Allocating more memory helps fragmentation, but causes OOM with many netns
Eric suggest using mem cgroup. But job with excess fragments will OOM
Feels there is a no good default

MPTCP Checksum

Checksum in addition to the TCP checksum (at the MPTCP datablock level)

Option is currently disabled by default
Would like hw offload support
Saeed feels that TX is doable, but not RX

Is MCTCP blocking some TCP-specific optimisation

Willem comments that changes to, f.e. TCP timestamp, also needed MPTCP updates
Not obvious that tcp and MCTCP are in different directories
And testing is unlikely
Issues is that there are two implementations of the same interface
Paolo, MCTCP has issues such as sub-flows to deal with. This has lead to similar implementations of (mostly) the same interfaces
Willem, need to figure out how to get more testing, and centralised location for tests
Paolo, Have tried to implement test coverage, including packet drill

Should MPTCP use PRs?

General question: which subsystems should use PRs?
Jakub indicates no preference: whatever is easier for subsystems
Patches need to be posted to netdev at least once
Motivation is that MPTCP maintainer(s) (not present) would find this easier
Several members of the audience were dubious about that, but there were no objections

Self test Coverage

How to evaluate?
Alexei advises that it can be done with some manual work
Florian points to existing tolling (lcov?)

Jiri - Devlink

No slides

Used for orchestration, configuration; multiple devices
Idea was to create entity whose scope was different from netdevs; suitable for configuring device
Multiple types of devlink params: runtime, permanent, …
Permanent is intended to (partially) replace vendor tools that update device
Vendor-specific side channels persist
Would like to open discussion on extending API to allow configuration of objects inside device
Device has many objects: eswitch, VF, SF, …
Jakub asks for more concrete requirements
Some lively discussion about vendor vs vendor specific interfaces

Jakub, raises issue of rate limiting for RSS context
Saeed suggests TC+HTB
RSS contexts do not have represents and the main netdev has FQ not HTB configured

Simon raises the issue of modelling device
F.e. queues for VFs. Asks if this is for devlink
Jiri feels it is not
Then briefly discussed modelling devices in general
This will be the topic of Simon’s presentation tomorrow

Jiri asks how to receive a subset of iproute monitor messages
Suggests BPF
David A, BPF on netlink doesn’t work well; it is too complicated; perhaps because message has to be formed before it can be filtered, which is too late, may as well drop in user-space
Florian W suggests kfunc hooks
Toke suggests subscription mechanism
Discussion was inconclusive

Jakub asks about two PFs on single device, as per recent patches from Corigine
Currently separate devlink objects
Possibility is to denote that one devlink instance is a peer of another
Jiri will look into this
Simon agrees that it should solve the problem

Florian F - mDNS wakeup and offload

Have slides
Set top boxes connected via Ethernet or WiFi
Need to achieve Network standby due to EU CoC or Energy Star
Wake-up on mDNS supported by some Broadcom drivers
Hardware does not support matching only specific mDNS service; fallback to low-power CPU
No upstream API to define strings to match on; typically hardcoded in firmware

mDNS offload
Querying for casting device is not enough
Device will query when joining network even if not streaming video
Streaming is ultimate identifier of intent; match on SYN/ACK
Need to have db of mDNS records in fw
No API for this (Android is expected to have one but it’s not public)
Can extend ethtool_rx_flow_spec?
Expect < 5 matches; at worst < 10
There was no objection to expanding ethtool with configuring an explicit mDNS offload DB

Vladimir Oltean

Has slides

IEEE 802.1CB - Frame replication and Elimination for Reliability

Purpose is to ensure there is redundancy in the network; zero failover time
Packets are sent on multiple paths, as close to the sender as possible
An alternatives are HSR/PRP, which are for use on rings
Redundancy tag is an L2 header
Paolo suggests that it can be modelled using a networking devices
Issue is determining the stream that packets belong to, would need to replicate large portions of the tc classifiers
Idea from Vladimir is one sw netdev per stream
Another idea from Jiri is to put the devices in the block, use the indev for the classier to differentiate, and forward to a dummy (or veth) device that a socket can be bound to
Jakub suggest that shared actions could be used
User applications could to open AF_PACKET sockets and dynamically create tc filters and actions, which may also enable the hanic use case

Ethernet over backplane links

802.3 Clause 73 auto-negotiation used on backplanes and SFP28 modules
Requires special auto-negotiation and link training; on NXP hw this is assisted by software; one RFC sent using phylib and one using phylink_pcs.
Asks which phy-mode is used to describe MAC link to backplane internal PHY. With phylink_pcs, the AN/LT block is not a PHY, so perhaps “internal” is not fine.
Is there a way to detect the media type of a link? Need to know whether to advertise the backplane modes (KR) or the SFP28 technology ability modes (CR) in the C73 base page
Florian F asks how widespread it is
Jakub answers that there is currently no way
Jiri suggests device tree
What does phylink do with SFP28 modules? Does it expect C73 autoneg?
Florian F suggests trying some different SFP28 modules to find out if they expect C73

Day 2 - 26th September 2023

Jakub

Has slides

Process

Stats, pw-bot, check stats, …
Some discussion of pw CI

No correlation between pass/warn/fail and acceptance of patches
Need to work on noise in results. Patches welcome!
There is a list of people who are not to be contacted. Patches welcome!
Will work on ageing: f.e. No contact if file has been untouched by an individual for 3 years [Jakub]

Suggested process

Can’t test every patch, there are too many
Automatically select patches from patchwork, create testing branch, build image with latest tools
Spawn VM for non-hw tests
For vendor tests, make images available for vendors to download. And provide mechanism for vendors to upload results

Florian F suggests looking at kernel-ci, which aggregates results from many sources

Jakub says the UI wasn’t great; and site seemed to often be down
No relevant Networking vendor seems to be working with kernel ci at this time
No vendors present were confident about being able to do so

Queue API

goals: simplify config for drivers, create queues without device reset
{Memory, Queue}{alloc, free}, Queue {start, stop, restart}
David A: IB verbs equivalent is QP: handled by firmware, so can’t just reuse

Saeed: in mlx5 it is the same object in hw, but abstracted as two different object types in fw

Configuration APIs not exposed to driver directly, driver always operates at queue level, and can request that all queues have the same config
Should allow us to provide cleaner APIs like ring size per RSS context without changing all the drivers

Willem

Has Slides

When to refactor complex code?

skb_segment, ip(6)_append_data
Florian W. Never a good time, because who knows what others are working on
Willem. Backporting is also a problem
Paolo. Never a good time = always a good time
Discussion of need for testing

Eric says that syzbot is good and fast

Willem: is there a policy on accepting kunit code into the kernel
Jakub: for core code it’s more than welcome, not as sure about driver code
Jakub: we can also add more DEBUG_NET asserts
Daniel. Suggests looking into socket lookup (inet_hashtables)

SO_DEVMEM: direct GPU data placement

Machines with many accelerators in cards, each with network card
Can’t pass volume of data via CPUs PCIe root port
Thus direct interconnect
Header split is required: headers in host stack, data in GPU memory
Need to be able to past non-host memory to NIC
Would like to avoid complicating/abusing struct page for device memory
One approach is to abuse page pool, to return non-page. Needs (much) discussion
Io_uring: there is a patchset which is similar but different
struct bio_vec

Replace struct page * with void *
If lowest bit is set then it is not a page
Eric, restricted to 4Gbytes
Willem, can only use where we use skb_frag_t (which is typdefed to struct bio_vec)

Queue Allocation

Per queue alloc/free
API 1: netlink admin commands
API 2: flow steering inferred from sk 3 or 5 tuple
API 3: verbs API / devlink subfunctions

AF_XDP + BPF

RSS: relax condition

Allow XDP program to receive packets from all queues, not just queue that socket is bound to
Jesper, current implementation is a (lockless) optimisation.
Work around is to attach program to all relevant queues
Willem, use case is using a modern version of packet sockets
Want to get traffic from kernel path without setting up ntuple
Jakub, in zero copy mode XDP is worse for consuming only a small number of packets

Busypoll kthread

Option 1: process on CPU A, kernel on CPY B translating phys <-> af_xdp desc
Option 2: Immediately reschedule NAPI handler
Gap between AF_XDP and DPDK is phys/virt desc translation
Worth dedicating a core to resolve this

Floran W - IPsec Workshop summary

IPsec child SA: Accelerating single tunnel by spreading load to separate CPUs

Hold up is Steffen’s employers request for RFC: there are 3 competing standards
Last draft: https://datatracker.ietf.org/doc/draft-ietf-ipsecme-multi-sa-performance/
Has code for Linux (kernel) and userspace (ike daemons): strongswan and libreswan

Support for nftables+IPSec offload

5x speed up
First packet goes through the stack, then offload
Some cargo code present
Most controversial part is pre-ingress hook to bypass (avoid) tun/tap devices

Ipsec traffic flow security

In progress for 5 years
In RFC since January: https://www.rfc-editor.org/rfc/rfc9347.txt
Can still observe timing, packet size, … of encrypted traffic
Solution is to use cell size and pacing
There is code for linux, but Florian hasn’t seen it

PF key deprecation

Unmaintained, has security holes
All popular tools use new API, for 10 years or so
Unfortunately offload API uses old PF key algorithm names
(New API is netlink_xfrm API)
Predict deprecation and removal to happen sooner than later

Daniel

Has slides
H/D split

Single TCP stream performance testing with zero-copy
Currently no user-available NIC/driver to make use of TCP zero-copy
Eric, many NICs support header split
But there is a performance hit: two DMAs instead of one
Optimization: NICs pull headers into descriptor instead and only DMA payload portion
Have patch to implement header split in mlx5 in legacy mode, striding mode more complicated
Header/data split could be configurable per queue with queue API, and then only a portion of the traffic needed for ZC could be steered there.

BIG TCP & ZC

MAX_SKB_FRAGS as Kconfig option with range 17 - 45
Need for recompile provides friction to adoption
Current configuration increases memory consumption
Saeed: MAX_SKB_FRAGS of 45 will break mlx5
Eric: switch to disallow frag_list generation, but won’t help with MAX_SKB_FRAGS 17
Potentially best option is to implement frag_list support with ZC

XDP + bpf_mprog

XDP action: requires driver modification (XDP_NEXT could just be folded into XDP_PASS potentially)
Toke suggests integrating into libxdp
Paolo: Wonders if XDP main loop can be factored out
Toke, or hide behind prog ptr. Already used to add trampoline
Also, we want this processing loop to be a trampoline at some point
Could also be used to remove indirect calls
Drivers need to be converted one by one which will be a lengthy process

Jakub: is there interest in XDP program per queue?

Daniel: yes, once we have the queue API as object, we can make this configurable per queue, and have RSS steer to them.
Jesper, AF_XDP needs this
Saeed, per-application, thus need current level of control

Eric

Struct file reorganisation

Current layout is silly for networking: fields required for FD -> socket lookup spread over 4 cachelines
Recently regressed due to reorganisation targeted at improving unixbench result
Need to keep this benchmark in mind with any further changes

Don’t clone packets in dev_queue_xmit_nit

Cloning packets to run BPF filters, then dropping 99% of them, makes little sense
Better to run filter before clone if we expect to queue in af_packet

Defer wake-ups for TCP/SCTP/MCTCP …

sk->sk_data_ready and friends are usually called with socket lock held
Leads to high lock contention, and slow down when unlocking the socket (since another thread touched that cacheline)
epoll/process scheduler logic can be quite expensive
Idea is to postpone calls until after lock is released

store a flag (but not in the socket to avoid dirtying the cacheline) and read it on release?
Willem: a per-cpu variable like xmit_more?
Paolo: sometimes sk_data_ready is called from process context

FQ with 3 bands and WRR

Google version of FQ uses 3 prios/bands
Replicates some of DRR and PRIO, but with fewer hops and better scheduling after watchdog timer completion
No objections for extending upstream FQ from the room

UDP listen/accept/and 4-tuple lookups

Special guest Willy Tarreau explains
HAProxy implements QUIC in userspace
Single descriptor doesn’t scale well
Considered using multiple sockets for UDP
Bind with SO_REUSEPORT and connect to the peer
Works well other than race between bind and connect: the socket can receive traffic between the 2 calls
With many (10k+) connections there is a problem with UDP hash
It checks only dst addr+port
Implementation notes that most services use few UDP ports; this was 20 years ago; but is no longer the case
Complication is that hash function is generic
A quick hack with custom hash function, seems to address performance problem
Seems that real question is how to better hash UDP connections
Also looking to move away from connect/bind solution
Other than race (mentioned above), most QUIC servers listen on port 443, which is a privilege port, which precludes dropping privilege which might otherwise be possible
One option is that accept on UDP socket creates new socket which corresponds to pending packet(s)
Could then accept (rather than recv)
This would make sense from user-space
Eric, concurs that this is a good solution. Mentions it has been considered earlier
At the time SO_REUSEPORT was the solution
Williem, what has changed in the past 5 years?
Eric, at that time lookup was only based on destination addr+port
Now we would need a 3rd hash

Which would effect unconnected sockets (add a lookup just to check if we have a connected socket, then fall back to an unconnected socket)
There is a tradeoff between handling connected and unconnected sockets

Willem, asks why ha proxy moved from unconnected to connected sockets
Alexei, Meta uses a BPF map to resolve this problem

Martin KaFai Lau is expert

HAproxy sees 4x better perf from using connected sockets

for a proxy, cores are never equally loaded and rebalancing is useful

Not saying that we shouldn’t change hashing, but rather that problem is well understood (by other people such as Martin)

Iwashima-san - SYN Proxy at Scale with BPF

Has slides

Ongoing work to rewrite module using BPF
Netfilter synproxy is not used in AWS production as it consumes resources

Different ISN used between client and backend
Requires fixup

In new implementation

Rolling salt is part of ISN, shared between all synproxy nodes
All nodes share the same secret
Any node can validate SYN cookie statelessly

If backend could validate ISN, then fixup wouldn’t be required

Idea is to add SOCK_OPS BPF hook

Also discussed handling Timestamp option
Willem, confirms that main problems that there are two per-host secrets that are not shared
This is similar to RSS
Asks if sharing secrets would resolve all if not most of the problems
Iwashima-san, yes, that would resolve the problem
Willem, proposed solution is far more flexible
But making secrets available to admin to read (and share) may be far simpler for the problem at hand
Jesper, Nividia implemented synproxy for XDP
Could use this in conjunction with Willem’s proposal
Willem, could also support rolling salt: machines may come and go, but fleet is always up
Florian F. Currently only entropy is time which is part of hash
Dual hash handling is problematic. Need to check twice or sacrifice a bit to indicate generation

TCP Extended Data offset

IETF Draft. Proposed in 2016
Increases option space in non-SYN segment
EDO option and EDO Extension
EDO Extension must be included in all segments except for reset
Implemented for linux, including various updates for TCP and MCTCP stack
Enabled / disabled at runtime (via sysctl?)
GSO needs to be disabled before connection establishment
Minor regression when testing localhost (which doesn’t use GRO)
Jesper, Eric, suggest testing non-local host
Willem asks how developed proposal is, asks if consideration was given to negotiated scaling options length
Eirc indicates that precluding GRO is a huge problem (for Google), made some disparaging remarks about IETF

Sabrina

SW simulation of HW offloads

Full simulation

Good: able to test core changes and error paths; more self tests + fuzzing
Bad: A lot of effort to maintain and address corner cases

Limit set of offloads ?

Ipsec, macsec, tls
TSO, vlan, …

Vladimir notes that TC was added to netdevsim, but only control portion
David A

Can make sw model, with descriptor
And either feed into tap (for visibility), or back-to-back

Florian F

Qemu model
Can test real, unmodified, driver

Willem

Checksum and TSO could be implemented
Sabrina’s concern is these it changes the packet

Jakub

Point is to generate edge case packets to feed into fallback path

Saeed, Vladimir

Not applicable to TC qdisc offload, either qdisc is in hw or not

Netdev features expansion

Only one unused bit, as has been the case for ~2 years (all bits used, except __UNUSED_NETIF_F_1)
Eric, features make sense on TX path
But a lot of bits are used on RX, where we don’t care about feature set
The difference is if a mask is applied or not
Jakub, TLS bits can also be moved, they are only used in control path
Only use feature bits for fast path
Jesper suggests that features could be reordered as an optimisation; asm sometimes loads a single byte
Eric, split features into two sets: fastpath and other. Combine as needed
Floran F. Some options should not be configurable? F.e. VLAN extraction
Eric, Willem. Can be useful for testing or edge cases
Eric. Maybe some are obsolete

How to increase the number of reviewers

Maintainers do take into consideration who the reviewer is
Can use # comment after tag to clarify scope of review
Paolo clarifies that Reviewed-by is stronger than Acked-by
But Eric and Toke use these tags differently

Reviewed-by means fully understand; can fix bugs
Acked-by means, yeah, that’s a good idea

Maintainers check for “acked/reviewed if this is fixed”, but reviewers should too
Reviewers should follow up on updated patchsets
Jakub notes there is a document on reviewer expectations, includes no need for formal Reviewed-by tag
Willem suggests that internal review checks may be an option (along the lines of internal review of patches)
Discussed tooling such as LEI to narrow scope of netdev. Will follow-up in Saeed’s talk
🏴 Jakub (and SImon?) to update Netdev process to include information for reviewers, to lower the bar to entry by making expectations clearer

Saeed

Driver reviews

Suggests automatic delegation
Jakub improvements to patchwork are pending
This is about engagement, would like vendors more engaged
Suggests adding maintainer entries
Would also like to allow subscription to a path rather than the entire ML
Saeed suggests driver tree
Jakub is worried it would mean diluting existing review, e.g. from Andrew

Vendor patch submission + delegation

mlx5 has 50 - 100 patches per cycle
One developer per vendor with patches outstanding is painful
But some high profile developers submit independently
Jiri feels that bottleneck is mlx5, batches of 15 are a problem
Jakub, concern is that people submitting actively participate, otherwise we’ll be moving the bulk of the pain onto core maintainers.

Patchwork status

Jakub pw-bot was down for the past few weeks, this probably explains things

Vendor bug reporting

Nvidia recently started testing net-next
Stack breaks bi-weekly
Jakub, will create testing tree
Outstanding patches in patchwork will automatically be applied
nVidia can test that and report before patches are merged
Wipe and repeat every 3 - 6 hours
That branch can be tested by CI

Devlink Device Orchestration

Need association between function/port/aux and devlink instance
SF/VF Orchestration

Create, configure, deploy mode
Should we support VF creation via devlink?
Jiri. Issue is that VF is part of PCI spec, which mandates association with PF
Saeed, proposes dummy PF devlink instance
Jakub, late population of IRQs may be an issue
Seems to be more of a PCI than a Networking issue

Modelling multi function devices

No model for parent device (PF per port)
Socket direct (PF per NUMA node, single port)

Sindle netdev multi devlink instances

Simon - Modeling DPUs / IPUs

Has slides: 2023-09-25 Modeling DPUs / IPUs
Types of NICs

accelerator NIC: some packet processing on the NIC
offload NIC: all packet processing on the NIC, for VM traffic the host doesn't see the packets at all
DPU/IPU: control of the datapath is on the NIC

example use case for DPUs: cloud provider can rent out the entire baremetal machine, but remain in control of the datapath
there are use cases for DPUs, but it's still a bit unclear how the industry will adopt them
Jesper suggested exposing sockets
Willem states most common use case, in his experience, is bare-metal isolation
Jiri raised that there are two models

Separate host - Toke described them as just another host in a cluster
Really smart device

Florian F drew parallels to WiFi where firmware passes control packets to host

The host then converts them to a canonical format which is passed to userspace

David Ahern

Has slides (https://github.com/dsahern/media/blob/master/2023/netconf-Sept-2023/low-cost-counters-linux-tcp-ML.pdf)
Enfabrica is working on devices with multiple 800G links
Linux TCP for ML Use Cases

ZC Direct to/from GPU

Should TCP window consider Buffers available in buffer pool

Flushing S/W queues when devmem or host memory is invalidated
H/W Queues for free list in Userspace

Eric is concerned it will add another path
Willem, this brings device details into user-space
Perhaps hw queues with standardised interface (descriptors?)
Jesper, likes io_uring design better than AF_XDP (which he co-designed)
David notes that RDMA exposes QPs to user-space
Toke, RDMA QPs are standardised? Saeed and others, no

BIG TCP for IPv6

Drop use of extension header
Removed by driver or needs H/W support
Eric, notes that previous changes haven’t been propagated tcpdump/libpcap, so the extension header is still required

Zero Cost Counters for Userspace Monitoring

Fast, low overhead, end-to-end metrics to userspace

Current APIS for Networking Stats

ethtool -S: vendor specific
ip -s different set of stats

Outline of Infrastructure

Counters are mapped to processes read-only
Unregister command to remove mapping

Stats group

Move existing stats into separate page
Core code has direct access to stats exported by each layer
Eric, a TCP socket is ~2k, maybe it could be mapped?
Paolo, thinks there are other structs that are of interest
Jesper, mentions there are a lot of problems reported in production with cgroup stats

Driver and Hardware Counters
Derived Counters

Computed values

Selectable counters

Typically only interested in a few statistics
Very flexible approach possible, but has overhead for runtime and code maintenance
May be limitations on selecting driver stats

Why eBPF is Not the Answer

eBPF allows customer counters across the stack
But has performance limitation (~50%): 4Mpps -> 2Mpps
Eric, doubts there is a need to run tracepoint for each packet
Jesper, agrees this will kill performance
Toke, offsets can be handled by BTF
Eric, powerpc has 64k pages
David: This is a 4k page (x86) centric solution
Willem asks how this scales for workloads with lots of TCP socket churn
David: this would not be per-socket
Florian F cautions against a Networking centric solution
Eric notes that an IPv6/TCP socket already uses more than a page
David, don’t expect this to be merged within next year. Need to have idea and discussions first. Then move to prototype. ….
Willem. Ethtool stats doesn’t report per-queue stats in a structured way and the legacy -S output is a mess as the number of per-queue stats grows.
Alexander, Toke: depends on NIC

Florian W

Fixing CVEs in nftables

Most relate to two-phase commit protocol for configuration
Should get better over time
Reducing expressiveness of language to match needs of front-end
Some problems relating to ipchains (and before)
Can’t be addressed in the context of network namespaces, or at least ton of work
F.e. combinatorial explosion of table jumps on a per-packet basis
Easy enough to fix for nftables
But not for iptables, ebtables, arptables, and so on …
Only practical solution is a jump counter: drop packet if exceed
Anyone with CAP_NETADMIN can bring down a box

and with unprivileged namespaces, that's every user

Crated private patch to add sysctl to disable nftables and ipstables if unprivileged containers are enabled
Eric, would also like root w/o CAP_NET_ADMIN in unprivileged namespace

Tired of CAP_NET_ADMIN bugs
Not realistic to roll out new kernel each week

Eric also suggests to reduce depth of virtual devices from 8, to say, 4

Contrack assumes that current skb is exclusive owner of newly allocated nf_conn object

In situations, such as bridge with vlan device, skb is cloned and this assumption breaks
Don’t want to update clone to do copy
Instead investigating reference counter based approach which facilitates copy when necessary
Affects bridge netfilter
Conntrack for cls_bpf may have the same problem

Special thanks to Simon and Sabrina for taking notes throughout the talks!