NetConf 2023
Day 1 - 25th September 2023
Alexei - BPF and Networking Conferences
- No slides
- It is nice to have NetConf back after a 3 year hiatus!
- LPC and Netdev are bigger events, but not coordinated with each other
- Willem:
- LPC, NetConf and Netdev.conf are too close to each other
- 2 conferences a year are fine, for each of BPF and Networking
- But they should be spread out
- Uneasy about merging NetConf and Netdev.conf
- Doesn’t want to lead liaison, but happy for someone else to step up
- Happy to have NetConf colocated with Kernel Recipes
- Netdev typically piggy-backs on a larger conference (restricts timing options)
- LPC has the value of other tracks
- LPC is a large conference. 500 attendees. Always sells out
- Will reach out to Jamal (Netdev)
- Proximity to LPC is a key issue, but there are many factors to consider
- LPC track is really BPF
- Should there be a separate Networking track?
- Netdev.conf is useful regardless, it tries to reach out to universities etc. Has a broader spectrum of attendees.
- Plumbers was supposed to be about discussion rather than lectures
- But perhaps it is too big for discussion
- Very few people are volunteering talks; something is wrong
- Need to tweak the format, but how?
- David M, Jiri, Alexi, Jakub
- Some discussion of merging Netdev into LPC
- Issue may be control for existing Netdev organisers
- Came back to key issue is proximity of LPC and Netdev
- Some discussion of attaching NetConf to LSFM
- Some discussion of back-to-back with Kernel Recipes instead of overlapping
- People want to attend Kernel Recipes
- But there could be extra overhead in organisation
- And 5 consecutive days “wipes people out”
- Conclusion (Daniel/Jakub):
- Having two networking conferences a year would be preferred and venues could be netdev.conf (spring) and LPC (autumn). Needs coordination with Jamal.
- Needs to ensure it doesn’t overlap with LSF/MM/BPF which is spring too
- For future LPC, it would be good to have BPF and Network track as two separate entities rather than merged.
- Perhaps LPC could have submission type talk vs discussion
- Action: try to propose moving Netdev.conf out to spring / early summer to space the conferences out [David Ahern]
Alexei - Tree Management
- No Slides
- Most patches go through sub trees, get some soaking before hitting net-next
- But some patches go directly into net-next, which can lead to breakage, which affects others
- Proposes that net-next only be an integration tree
- Saeed, proposes new integration branch
- Florian W, proposes an acceptance branch
- Netfilter has done this with ‘testing’ branch
- Bots and so on often catch problems there
- Jakub, key is to have CI, and enforce its use (reject patches if needed). If we don’t have CI, tree shuffling will not help. If we do, we can run it without changing trees/process.
- Paolo, problem with CI, is we don’t have coverage, and it will take a long time
- Jakub, if we have coverage we can just run it on linux-next…
- Paolo, it is a big if. Key is to increase coverage
- Jakub: need to ensure that tests are running. Need infrastructure.
- A lot of discussion of the difficulties of making CI available, reliable…
- Alexei: it is a full time job for a team at Meta to run BPF CI
- Saeed, also have a large CI team at Nvidia
- Action: build automation to collect pending patches from the list and put them on a branch so people can run their CIs on it [Jakub]
Toke - XDP: Past, Present and Future
- Has slides
- Summarised features since last NetConf (slide)
- Map lookup, hashmap type, bpf_redirect performance
- Jesper would like to see users before optimising
- Need_wakeup mode
- User-space app running AF_XDP on same CPU as network
- Avoids syscalls as much as possible
- Summarised ongoing work (slide)
- Jesper: Aim to avoid naive implementations that hurt performance
- Toke: turns out that people like containers
- Then they want SR-IOV
- Then they want policy
- Possible to apply policy in ns of physical device
- Then run AF_XDP in tenant container
- Performance is ~8Mpps
- AF_XDP for virtio-net
- RFPC to redirect frames to crypto engine
- XDP driver support (slide)
- 39 drivers support XDP
- 0 drivers enable all features by default. Mlx5 is probably closest
- Toggle would be a good step
- Which could be adjusted when XDP programs are loaded
- Common defaults too
- Have had some success in folding features into “basic”
- ⚑ Simon will check if nfp has redirect support
- David A asks how many people know about / use monitor
- Alexei proposes a strawman: remove drivers that don’t support XDP. His real point was to deprecate unused drivers.
- Some discussion of why WiFi doesn’t support XDP
- Currently used for high performance data centre use cases
- Would like to see it move to generic tool for fastpath
- Currently proprietary hw offloads are used instead
- A migration path from DPDK is to use DPDK AF_XDP driver
- Key issue with DPDK is deployment model
- Address problems in non-zero copy mode; people care about this
- Willem: some (Intel) drivers require XDP program to be loaded to use AF_XDP, but this doesn’t make sense
- Jesper: can be addressed now that we have features
- The term metadata is overloaded/meaningless
- Saving metadata would be opt-in
- Willem, GRO is a great case for XDP
- It is one of the few packet parsers in the kernel
- But moving it further down the stack seems hard
- Toke, need a place to buffer XDP frames
- Would enable many interesting things
- Queueing xmit side (slide)
- Eric feels that kthread option might make sense
Jesper - Page pool
- Has slides
- Recycling is now everywhere
- Have fragmenting pages. Moves away from one page per packet
- Recent proposal: hide pp_frag_count
- Page pool no longer handles pages?
- Motivation is for Arm64 with 16K pages, one page per packet is wasteful
- Maybe rename to netmem or something else
- Have got good performance by addressing specific cases
- Generalising may partially reverse this
- Suggests adding more memory allocator types
- Discussed returning non-pages
- Currently overloading bit in skb
- Store in frags with page pointer with magic bits set
- The API is growing, makes it difficult to provide alternate implementation / replacement that’s not page-backed
- Asks if oot benchmarks can be easily changed to use multiple buffers per page
- Jesper: can be easily done
- David: would offer some insights regarding usefulness of such an approach
Alexander - GRO Overhead
- No slides
- For each packet we create SKB, feed into GRO, then just store head and pass to NAPI cache
- May be better to pass XDP data to GRO; quickly pass frame to GRO layer
- Willem, we already have NAPI GRO by frags
- Eric, but still need to allocate/initialise the skb
- David A, would be a trade off, as you’d need to iterate over a NAPI-budget worth of frames, as they are related to each other
- Jesper, only need a small amount of information from hw to feed into hw.
- Implementation is a challenge
- Also agrees with Eric that benchmarking is required
- Two options
- XDP aware GRO
- GRO implemented (partially) in XDP
- Jesper, believe GRO is overhead. ~15%
- Alexander. Most drivers pass packet type (TCP, …) in the descriptor. Could pass this into stack
- Quite some discussion about HW GRO, hints, XDP GRO, …
- Willem describes that sufficient state to feed hings into sw GRO is too large
- Eric describes scheme used by Google
- HW provides 16bits metadata: start (1bit); end (1bit); flowid (14bits)
- Driver assembles skb based on this information, sw GRO is mostly unused
- Jakub, this is really just a GRO-HW implementation if HW does full matching
- Io_uring has locking scheme to allow use of user pages by kernel and handling the case where the user space process terminates
- Christoph Hellwig (not present) has complained that page pool should use DMA APIs to avoid starving DMA
- Eric, if a hugepage is used to store many frags, then all frags share a single lock
Jesper - ksoftirq
- Has slides
- “softirq: Let Ksoftirqd do its job” was reverted in May 2023
- Original patch solved UDP overload case
- But other cases became apparent over time; revert was probably the best option
- Cloudflare see less time in softirq with patch, but innocent userspace threads suffer
- An option is to use a different API
- Multi UDP-message recv doesn’t work; just loops over each message
- UDP GRO (pending testing)
- Io_uring UDP doesn’t help
- Eric, batching UDP across user-space/kernel will shift problem but not solved
- Changelog of revert describes how to resolve problem with modern kernels
- Willem, effects any workload where a significant amount of traffic doesn’t have flow control
- Jakub, in that case busy-poll is a solution
- Jesper, has seen this used in conjunction with AF_XDP
- Eric, the original change was all about latency (it made latency worse).
- Need to measure latency
- Question about using RPS with threaded NAPI. Currently doesn't work, would require the threaded backlog patches proposed by Sebastian.
Paolo - Inclusive Language and more
- Deprecate problematic API
- Add new enums/defines with inclusive names and values
- Sabrina, Eric have concerns about backporting
- Saeed asks about deprecating old code
- Paolo, initially just add duplicate uAPI
- Later reduce usage of non-inclusive language inside in kernel
- Daniel asks if there is a kernel-wide policy; how other subsystems handle this, maybe by LF TAB?
- Florian F and others. SPI is an example, a rename has occurred there
- Jakub asks if the size of the problem has been quantified
- Paolo. For uAPI there is bonding, bridge and team. But inside kernel usage is more widespread
- Packet header parsing
- Code is scattered
- Eric, need to prefetch the next packet. Current packet is not enough
- GRO is expensive, because it is the first layer to touch data, which is also touched by other layers
- Paolo, an option is to make monolithic GRO for TCP (and maybe UDP) with fallback to current model
- No data to back this idea at this time
- Jesper, could instrument with perf stat
- Eric, at high packet rates all the cache lines should be hot
- Paolo, another idea is to use a single structure for L3 and L4 callback
- Eric feels it would not make a great deal of difference, given the cache size of modern CPUs
- Plenty of other places to reduce cache line usage
- Syscall using a socket id.
- Struct file is 256 bytes
- Operation requires 4 cache line
- Long discussion of reordering structures to optimise for cache lines, and avoiding regressions
- Willem raised the issue of dcache vs icache pressure. Adding extra code to avoid a cache-line miss may not be a win
- Jesper suggests using perf to instrument dcache and icache usage
- Current default settings are problematic
- Allocating more memory helps fragmentation, but causes OOM with many netns
- Eric suggest using mem cgroup. But job with excess fragments will OOM
- Feels there is a no good default
- Checksum in addition to the TCP checksum (at the MPTCP datablock level)
- Option is currently disabled by default
- Would like hw offload support
- Saeed feels that TX is doable, but not RX
- Is MCTCP blocking some TCP-specific optimisation
- Willem comments that changes to, f.e. TCP timestamp, also needed MPTCP updates
- Not obvious that tcp and MCTCP are in different directories
- And testing is unlikely
- Issues is that there are two implementations of the same interface
- Paolo, MCTCP has issues such as sub-flows to deal with. This has lead to similar implementations of (mostly) the same interfaces
- Willem, need to figure out how to get more testing, and centralised location for tests
- Paolo, Have tried to implement test coverage, including packet drill
- General question: which subsystems should use PRs?
- Jakub indicates no preference: whatever is easier for subsystems
- Patches need to be posted to netdev at least once
- Motivation is that MPTCP maintainer(s) (not present) would find this easier
- Several members of the audience were dubious about that, but there were no objections
- How to evaluate?
- Alexei advises that it can be done with some manual work
- Florian points to existing tolling (lcov?)
Jiri - Devlink
- Used for orchestration, configuration; multiple devices
- Idea was to create entity whose scope was different from netdevs; suitable for configuring device
- Multiple types of devlink params: runtime, permanent, …
- Permanent is intended to (partially) replace vendor tools that update device
- Vendor-specific side channels persist
- Would like to open discussion on extending API to allow configuration of objects inside device
- Device has many objects: eswitch, VF, SF, …
- Jakub asks for more concrete requirements
- Some lively discussion about vendor vs vendor specific interfaces
- Jakub, raises issue of rate limiting for RSS context
- Saeed suggests TC+HTB
- RSS contexts do not have represents and the main netdev has FQ not HTB configured
- Simon raises the issue of modelling device
- F.e. queues for VFs. Asks if this is for devlink
- Jiri feels it is not
- Then briefly discussed modelling devices in general
- This will be the topic of Simon’s presentation tomorrow
- Jiri asks how to receive a subset of iproute monitor messages
- Suggests BPF
- David A, BPF on netlink doesn’t work well; it is too complicated; perhaps because message has to be formed before it can be filtered, which is too late, may as well drop in user-space
- Florian W suggests kfunc hooks
- Toke suggests subscription mechanism
- Discussion was inconclusive
- Jakub asks about two PFs on single device, as per recent patches from Corigine
- Currently separate devlink objects
- Possibility is to denote that one devlink instance is a peer of another
- Jiri will look into this
- Simon agrees that it should solve the problem
Florian F - mDNS wakeup and offload
- Have slides
- Set top boxes connected via Ethernet or WiFi
- Need to achieve Network standby due to EU CoC or Energy Star
- Wake-up on mDNS supported by some Broadcom drivers
- Hardware does not support matching only specific mDNS service; fallback to low-power CPU
- No upstream API to define strings to match on; typically hardcoded in firmware
- mDNS offload
- Querying for casting device is not enough
- Device will query when joining network even if not streaming video
- Streaming is ultimate identifier of intent; match on SYN/ACK
- Need to have db of mDNS records in fw
- No API for this (Android is expected to have one but it’s not public)
- Can extend ethtool_rx_flow_spec?
- Expect < 5 matches; at worst < 10
- There was no objection to expanding ethtool with configuring an explicit mDNS offload DB
Vladimir Oltean
- IEEE 802.1CB - Frame replication and Elimination for Reliability
- Purpose is to ensure there is redundancy in the network; zero failover time
- Packets are sent on multiple paths, as close to the sender as possible
- An alternatives are HSR/PRP, which are for use on rings
- Redundancy tag is an L2 header
- Paolo suggests that it can be modelled using a networking devices
- Issue is determining the stream that packets belong to, would need to replicate large portions of the tc classifiers
- Idea from Vladimir is one sw netdev per stream
- Another idea from Jiri is to put the devices in the block, use the indev for the classier to differentiate, and forward to a dummy (or veth) device that a socket can be bound to
- Jakub suggest that shared actions could be used
- User applications could to open AF_PACKET sockets and dynamically create tc filters and actions, which may also enable the hanic use case
- Ethernet over backplane links
- 802.3 Clause 73 auto-negotiation used on backplanes and SFP28 modules
- Requires special auto-negotiation and link training; on NXP hw this is assisted by software; one RFC sent using phylib and one using phylink_pcs.
- Asks which phy-mode is used to describe MAC link to backplane internal PHY. With phylink_pcs, the AN/LT block is not a PHY, so perhaps “internal” is not fine.
- Is there a way to detect the media type of a link? Need to know whether to advertise the backplane modes (KR) or the SFP28 technology ability modes (CR) in the C73 base page
- Florian F asks how widespread it is
- Jakub answers that there is currently no way
- Jiri suggests device tree
- What does phylink do with SFP28 modules? Does it expect C73 autoneg?
- Florian F suggests trying some different SFP28 modules to find out if they expect C73
Day 2 - 26th September 2023
Jakub
- Stats, pw-bot, check stats, …
- Some discussion of pw CI
- No correlation between pass/warn/fail and acceptance of patches
- Need to work on noise in results. Patches welcome!
- There is a list of people who are not to be contacted. Patches welcome!
- Will work on ageing: f.e. No contact if file has been untouched by an individual for 3 years [Jakub]
- Can’t test every patch, there are too many
- Automatically select patches from patchwork, create testing branch, build image with latest tools
- Spawn VM for non-hw tests
- For vendor tests, make images available for vendors to download. And provide mechanism for vendors to upload results
- Florian F suggests looking at kernel-ci, which aggregates results from many sources
- Jakub says the UI wasn’t great; and site seemed to often be down
- No relevant Networking vendor seems to be working with kernel ci at this time
- No vendors present were confident about being able to do so
- goals: simplify config for drivers, create queues without device reset
- {Memory, Queue}{alloc, free}, Queue {start, stop, restart}
- David A: IB verbs equivalent is QP: handled by firmware, so can’t just reuse
- Saeed: in mlx5 it is the same object in hw, but abstracted as two different object types in fw
- Configuration APIs not exposed to driver directly, driver always operates at queue level, and can request that all queues have the same config
- Should allow us to provide cleaner APIs like ring size per RSS context without changing all the drivers
Willem
- When to refactor complex code?
- skb_segment, ip(6)_append_data
- Florian W. Never a good time, because who knows what others are working on
- Willem. Backporting is also a problem
- Paolo. Never a good time = always a good time
- Discussion of need for testing
- Eric says that syzbot is good and fast
- Willem: is there a policy on accepting kunit code into the kernel
- Jakub: for core code it’s more than welcome, not as sure about driver code
- Jakub: we can also add more DEBUG_NET asserts
- Daniel. Suggests looking into socket lookup (inet_hashtables)
- SO_DEVMEM: direct GPU data placement
- Machines with many accelerators in cards, each with network card
- Can’t pass volume of data via CPUs PCIe root port
- Thus direct interconnect
- Header split is required: headers in host stack, data in GPU memory
- Need to be able to past non-host memory to NIC
- Would like to avoid complicating/abusing struct page for device memory
- One approach is to abuse page pool, to return non-page. Needs (much) discussion
- Io_uring: there is a patchset which is similar but different
- struct bio_vec
- Replace struct page * with void *
- If lowest bit is set then it is not a page
- Eric, restricted to 4Gbytes
- Willem, can only use where we use skb_frag_t (which is typdefed to struct bio_vec)
- Per queue alloc/free
- API 1: netlink admin commands
- API 2: flow steering inferred from sk 3 or 5 tuple
- API 3: verbs API / devlink subfunctions
- Allow XDP program to receive packets from all queues, not just queue that socket is bound to
- Jesper, current implementation is a (lockless) optimisation.
- Work around is to attach program to all relevant queues
- Willem, use case is using a modern version of packet sockets
- Want to get traffic from kernel path without setting up ntuple
- Jakub, in zero copy mode XDP is worse for consuming only a small number of packets
- Option 1: process on CPU A, kernel on CPY B translating phys <-> af_xdp desc
- Option 2: Immediately reschedule NAPI handler
- Gap between AF_XDP and DPDK is phys/virt desc translation
- Worth dedicating a core to resolve this
Floran W - IPsec Workshop summary
- IPsec child SA: Accelerating single tunnel by spreading load to separate CPUs
- Support for nftables+IPSec offload
- 5x speed up
- First packet goes through the stack, then offload
- Some cargo code present
- Most controversial part is pre-ingress hook to bypass (avoid) tun/tap devices
- Ipsec traffic flow security
- In progress for 5 years
- In RFC since January: https://www.rfc-editor.org/rfc/rfc9347.txt
- Can still observe timing, packet size, … of encrypted traffic
- Solution is to use cell size and pacing
- There is code for linux, but Florian hasn’t seen it
- Unmaintained, has security holes
- All popular tools use new API, for 10 years or so
- Unfortunately offload API uses old PF key algorithm names
- (New API is netlink_xfrm API)
- Predict deprecation and removal to happen sooner than later
Daniel
- Single TCP stream performance testing with zero-copy
- Currently no user-available NIC/driver to make use of TCP zero-copy
- Eric, many NICs support header split
- But there is a performance hit: two DMAs instead of one
- Optimization: NICs pull headers into descriptor instead and only DMA payload portion
- Have patch to implement header split in mlx5 in legacy mode, striding mode more complicated
- Header/data split could be configurable per queue with queue API, and then only a portion of the traffic needed for ZC could be steered there.
- MAX_SKB_FRAGS as Kconfig option with range 17 - 45
- Need for recompile provides friction to adoption
- Current configuration increases memory consumption
- Saeed: MAX_SKB_FRAGS of 45 will break mlx5
- Eric: switch to disallow frag_list generation, but won’t help with MAX_SKB_FRAGS 17
- Potentially best option is to implement frag_list support with ZC
- XDP action: requires driver modification (XDP_NEXT could just be folded into XDP_PASS potentially)
- Toke suggests integrating into libxdp
- Paolo: Wonders if XDP main loop can be factored out
- Toke, or hide behind prog ptr. Already used to add trampoline
- Also, we want this processing loop to be a trampoline at some point
- Could also be used to remove indirect calls
- Drivers need to be converted one by one which will be a lengthy process
- Jakub: is there interest in XDP program per queue?
- Daniel: yes, once we have the queue API as object, we can make this configurable per queue, and have RSS steer to them.
- Jesper, AF_XDP needs this
- Saeed, per-application, thus need current level of control
Eric
- Struct file reorganisation
- Current layout is silly for networking: fields required for FD -> socket lookup spread over 4 cachelines
- Recently regressed due to reorganisation targeted at improving unixbench result
- Need to keep this benchmark in mind with any further changes
- Don’t clone packets in dev_queue_xmit_nit
- Cloning packets to run BPF filters, then dropping 99% of them, makes little sense
- Better to run filter before clone if we expect to queue in af_packet
- Defer wake-ups for TCP/SCTP/MCTCP …
- sk->sk_data_ready and friends are usually called with socket lock held
- Leads to high lock contention, and slow down when unlocking the socket (since another thread touched that cacheline)
- epoll/process scheduler logic can be quite expensive
- Idea is to postpone calls until after lock is released
- store a flag (but not in the socket to avoid dirtying the cacheline) and read it on release?
- Willem: a per-cpu variable like xmit_more?
- Paolo: sometimes sk_data_ready is called from process context
- Google version of FQ uses 3 prios/bands
- Replicates some of DRR and PRIO, but with fewer hops and better scheduling after watchdog timer completion
- No objections for extending upstream FQ from the room
- UDP listen/accept/and 4-tuple lookups
- Special guest Willy Tarreau explains
- HAProxy implements QUIC in userspace
- Single descriptor doesn’t scale well
- Considered using multiple sockets for UDP
- Bind with SO_REUSEPORT and connect to the peer
- Works well other than race between bind and connect: the socket can receive traffic between the 2 calls
- With many (10k+) connections there is a problem with UDP hash
- It checks only dst addr+port
- Implementation notes that most services use few UDP ports; this was 20 years ago; but is no longer the case
- Complication is that hash function is generic
- A quick hack with custom hash function, seems to address performance problem
- Seems that real question is how to better hash UDP connections
- Also looking to move away from connect/bind solution
- Other than race (mentioned above), most QUIC servers listen on port 443, which is a privilege port, which precludes dropping privilege which might otherwise be possible
- One option is that accept on UDP socket creates new socket which corresponds to pending packet(s)
- Could then accept (rather than recv)
- This would make sense from user-space
- Eric, concurs that this is a good solution. Mentions it has been considered earlier
- At the time SO_REUSEPORT was the solution
- Williem, what has changed in the past 5 years?
- Eric, at that time lookup was only based on destination addr+port
- Now we would need a 3rd hash
- Which would effect unconnected sockets (add a lookup just to check if we have a connected socket, then fall back to an unconnected socket)
- There is a tradeoff between handling connected and unconnected sockets
- Willem, asks why ha proxy moved from unconnected to connected sockets
- Alexei, Meta uses a BPF map to resolve this problem
- Martin KaFai Lau is expert
- HAproxy sees 4x better perf from using connected sockets
- for a proxy, cores are never equally loaded and rebalancing is useful
- Not saying that we shouldn’t change hashing, but rather that problem is well understood (by other people such as Martin)
Iwashima-san - SYN Proxy at Scale with BPF
- Ongoing work to rewrite module using BPF
- Netfilter synproxy is not used in AWS production as it consumes resources
- Different ISN used between client and backend
- Requires fixup
- Rolling salt is part of ISN, shared between all synproxy nodes
- All nodes share the same secret
- Any node can validate SYN cookie statelessly
- If backend could validate ISN, then fixup wouldn’t be required
- Idea is to add SOCK_OPS BPF hook
- Also discussed handling Timestamp option
- Willem, confirms that main problems that there are two per-host secrets that are not shared
- This is similar to RSS
- Asks if sharing secrets would resolve all if not most of the problems
- Iwashima-san, yes, that would resolve the problem
- Willem, proposed solution is far more flexible
- But making secrets available to admin to read (and share) may be far simpler for the problem at hand
- Jesper, Nividia implemented synproxy for XDP
- Could use this in conjunction with Willem’s proposal
- Willem, could also support rolling salt: machines may come and go, but fleet is always up
- Florian F. Currently only entropy is time which is part of hash
- Dual hash handling is problematic. Need to check twice or sacrifice a bit to indicate generation
- IETF Draft. Proposed in 2016
- Increases option space in non-SYN segment
- EDO option and EDO Extension
- EDO Extension must be included in all segments except for reset
- Implemented for linux, including various updates for TCP and MCTCP stack
- Enabled / disabled at runtime (via sysctl?)
- GSO needs to be disabled before connection establishment
- Minor regression when testing localhost (which doesn’t use GRO)
- Jesper, Eric, suggest testing non-local host
- Willem asks how developed proposal is, asks if consideration was given to negotiated scaling options length
- Eirc indicates that precluding GRO is a huge problem (for Google), made some disparaging remarks about IETF
Sabrina
- SW simulation of HW offloads
- Good: able to test core changes and error paths; more self tests + fuzzing
- Bad: A lot of effort to maintain and address corner cases
- Ipsec, macsec, tls
- TSO, vlan, …
- Vladimir notes that TC was added to netdevsim, but only control portion
- David A
- Can make sw model, with descriptor
- And either feed into tap (for visibility), or back-to-back
- Qemu model
- Can test real, unmodified, driver
- Checksum and TSO could be implemented
- Sabrina’s concern is these it changes the packet
- Point is to generate edge case packets to feed into fallback path
- Not applicable to TC qdisc offload, either qdisc is in hw or not
- Netdev features expansion
- Only one unused bit, as has been the case for ~2 years (all bits used, except __UNUSED_NETIF_F_1)
- Eric, features make sense on TX path
- But a lot of bits are used on RX, where we don’t care about feature set
- The difference is if a mask is applied or not
- Jakub, TLS bits can also be moved, they are only used in control path
- Only use feature bits for fast path
- Jesper suggests that features could be reordered as an optimisation; asm sometimes loads a single byte
- Eric, split features into two sets: fastpath and other. Combine as needed
- Floran F. Some options should not be configurable? F.e. VLAN extraction
- Eric, Willem. Can be useful for testing or edge cases
- Eric. Maybe some are obsolete
- How to increase the number of reviewers
- Maintainers do take into consideration who the reviewer is
- Can use # comment after tag to clarify scope of review
- Paolo clarifies that Reviewed-by is stronger than Acked-by
- But Eric and Toke use these tags differently
- Reviewed-by means fully understand; can fix bugs
- Acked-by means, yeah, that’s a good idea
- Maintainers check for “acked/reviewed if this is fixed”, but reviewers should too
- Reviewers should follow up on updated patchsets
- Jakub notes there is a document on reviewer expectations, includes no need for formal Reviewed-by tag
- Willem suggests that internal review checks may be an option (along the lines of internal review of patches)
- Discussed tooling such as LEI to narrow scope of netdev. Will follow-up in Saeed’s talk
- 🏴 Jakub (and SImon?) to update Netdev process to include information for reviewers, to lower the bar to entry by making expectations clearer
Saeed
- Suggests automatic delegation
- Jakub improvements to patchwork are pending
- This is about engagement, would like vendors more engaged
- Suggests adding maintainer entries
- Would also like to allow subscription to a path rather than the entire ML
- Saeed suggests driver tree
- Jakub is worried it would mean diluting existing review, e.g. from Andrew
- Vendor patch submission + delegation
- mlx5 has 50 - 100 patches per cycle
- One developer per vendor with patches outstanding is painful
- But some high profile developers submit independently
- Jiri feels that bottleneck is mlx5, batches of 15 are a problem
- Jakub, concern is that people submitting actively participate, otherwise we’ll be moving the bulk of the pain onto core maintainers.
- Jakub pw-bot was down for the past few weeks, this probably explains things
- Nvidia recently started testing net-next
- Stack breaks bi-weekly
- Jakub, will create testing tree
- Outstanding patches in patchwork will automatically be applied
- nVidia can test that and report before patches are merged
- Wipe and repeat every 3 - 6 hours
- That branch can be tested by CI
- Devlink Device Orchestration
- Need association between function/port/aux and devlink instance
- SF/VF Orchestration
- Create, configure, deploy mode
- Should we support VF creation via devlink?
- Jiri. Issue is that VF is part of PCI spec, which mandates association with PF
- Saeed, proposes dummy PF devlink instance
- Jakub, late population of IRQs may be an issue
- Seems to be more of a PCI than a Networking issue
- Modelling multi function devices
- No model for parent device (PF per port)
- Socket direct (PF per NUMA node, single port)
- Sindle netdev multi devlink instances
Simon - Modeling DPUs / IPUs
- accelerator NIC: some packet processing on the NIC
- offload NIC: all packet processing on the NIC, for VM traffic the host doesn't see the packets at all
- DPU/IPU: control of the datapath is on the NIC
- example use case for DPUs: cloud provider can rent out the entire baremetal machine, but remain in control of the datapath
- there are use cases for DPUs, but it's still a bit unclear how the industry will adopt them
- Jesper suggested exposing sockets
- Willem states most common use case, in his experience, is bare-metal isolation
- Jiri raised that there are two models
- Separate host - Toke described them as just another host in a cluster
- Really smart device
- Florian F drew parallels to WiFi where firmware passes control packets to host
- The host then converts them to a canonical format which is passed to userspace
David Ahern
- Should TCP window consider Buffers available in buffer pool
- Flushing S/W queues when devmem or host memory is invalidated
- H/W Queues for free list in Userspace
- Eric is concerned it will add another path
- Willem, this brings device details into user-space
- Perhaps hw queues with standardised interface (descriptors?)
- Jesper, likes io_uring design better than AF_XDP (which he co-designed)
- David notes that RDMA exposes QPs to user-space
- Toke, RDMA QPs are standardised? Saeed and others, no
- Drop use of extension header
- Removed by driver or needs H/W support
- Eric, notes that previous changes haven’t been propagated tcpdump/libpcap, so the extension header is still required
- Zero Cost Counters for Userspace Monitoring
- Fast, low overhead, end-to-end metrics to userspace
- Current APIS for Networking Stats
- ethtool -S: vendor specific
- ip -s different set of stats
- Outline of Infrastructure
- Counters are mapped to processes read-only
- Unregister command to remove mapping
- Move existing stats into separate page
- Core code has direct access to stats exported by each layer
- Eric, a TCP socket is ~2k, maybe it could be mapped?
- Paolo, thinks there are other structs that are of interest
- Jesper, mentions there are a lot of problems reported in production with cgroup stats
- Driver and Hardware Counters
- Derived Counters
- Typically only interested in a few statistics
- Very flexible approach possible, but has overhead for runtime and code maintenance
- May be limitations on selecting driver stats
- Why eBPF is Not the Answer
- eBPF allows customer counters across the stack
- But has performance limitation (~50%): 4Mpps -> 2Mpps
- Eric, doubts there is a need to run tracepoint for each packet
- Jesper, agrees this will kill performance
- Toke, offsets can be handled by BTF
- Eric, powerpc has 64k pages
- David: This is a 4k page (x86) centric solution
- Willem asks how this scales for workloads with lots of TCP socket churn
- David: this would not be per-socket
- Florian F cautions against a Networking centric solution
- Eric notes that an IPv6/TCP socket already uses more than a page
- David, don’t expect this to be merged within next year. Need to have idea and discussions first. Then move to prototype. ….
- Willem. Ethtool stats doesn’t report per-queue stats in a structured way and the legacy -S output is a mess as the number of per-queue stats grows.
- Alexander, Toke: depends on NIC
Florian W
- Most relate to two-phase commit protocol for configuration
- Should get better over time
- Reducing expressiveness of language to match needs of front-end
- Some problems relating to ipchains (and before)
- Can’t be addressed in the context of network namespaces, or at least ton of work
- F.e. combinatorial explosion of table jumps on a per-packet basis
- Easy enough to fix for nftables
- But not for iptables, ebtables, arptables, and so on …
- Only practical solution is a jump counter: drop packet if exceed
- Anyone with CAP_NETADMIN can bring down a box
- and with unprivileged namespaces, that's every user
- Crated private patch to add sysctl to disable nftables and ipstables if unprivileged containers are enabled
- Eric, would also like root w/o CAP_NET_ADMIN in unprivileged namespace
- Tired of CAP_NET_ADMIN bugs
- Not realistic to roll out new kernel each week
- Eric also suggests to reduce depth of virtual devices from 8, to say, 4
- Contrack assumes that current skb is exclusive owner of newly allocated nf_conn object
- In situations, such as bridge with vlan device, skb is cloned and this assumption breaks
- Don’t want to update clone to do copy
- Instead investigating reference counter based approach which facilitates copy when necessary
- Affects bridge netfilter
- Conntrack for cls_bpf may have the same problem
Special thanks to Simon and Sabrina for taking notes throughout the talks!