Multi-core network packet steering

Network packet steering of transmitted and received traffic for multi-core architectures is needed in modern network computing environment, especially in data centers, where the high bandwidth and heavy loads would easily congestion a single core's queue.^[1]

For this reason many techniques, both in hardware and in software, are leveraged in order to distribute the incoming load of packets across the cores of the processor. On the traffic-receiving side, the most notable techniques presented in this article are: RSS, aRFS, RPS and RFS. For transmission, we will focus on XPS.^[2]^[3]
As shown by the figure beside, packets coming into the network interface card (NIC) are processed and loaded to the receiving queues managed by the cores (which are usually implemented as ring buffers within the kernel space). The main objective is being able to leverage all the cores available within the CPU to process incoming packets, while also improving performances like latency and throughput.^[4]^[5]^[6]

Hardware techniques

Hardware accelerated techniques like RSS and aRFS are used to route and load balance incoming packets across the multiple cores' queues of a processor.^[1]
Those hardware supported methods achieve extremely low latencies and reduce the load on the CPU, as compared to the software based ones. However they require a specialized hardware integrated within the network interface controller (which, for example, is usually available on more advanced cards, like the SmartNIC).^[7]^[8]

RSS

Receive Side Scaling (RSS) is a hardware supported technique, leveraging an indirection table indexed by the last bits of the result provided by a hash function, taking as inputs the header fields of the packets. The hash function input is usually customizable and the header fields used can vary between use case and implementations. Some notable examples of header fields chosen as keys for the hash are the layer 3 IP source and destination addresses, the protocol and the layer 4 source and destination ports. In this way, packets corresponding to the same flow will be directed to the same receiving queue, without loosing the original order, causing an out-of-order delivery. Moreover all incoming flows will be load balanced across all the available cores thanks to the hash function properties.^[5]
Another important feature introduced by the indirection table is the capability of changing the mapping of flows to the cores without having to change the hash function, but by simply updating the table entries.^[9]^[10]^[4]^[11]

aRFS

Accelerated Receive Flow Steering (aRFS) is another hardware supported technique, born with the idea of leveraging cache locality to improve performances by routing incoming packet flows to specific cores. Differently from RSS which is a fully independent hardware implementation, aRFS needs to interface with the software (the kernel) to properly function.^[12]
RSS simply load balance incoming traffic across the cores; however if a packet flow is directed to the core i (as a result of the hash function) while the application needing the received packet is running on core j, many cache misses could be avoided by simply forcing i=j, so that packets are received exactly where they are needed and consumed.^[8]
To do this aRFS doesn't forward packets directly from the result of the hash function, but using a configurable routing table (which can be filled and updated for instance by the scheduler through an API) packet flows can be steered to the specific consuming core.^[8]^[7]

Software techniques

Software techniques like RPS and RFS employ one of the CPU cores to steer incoming packets across the other cores of the processor. This comes at the cost of introducing additional inter-processor interrupts (IPIs); however the number of hardware interrupts will not increase and potentially, by employing an interrupt aggregation technique, it could even be reduced.^[13]
The benefits of a software solutions is the ease in implementation, without having to change any component (like the NIC) of the currently used architecture, but by simply deploying the proper kernel module. This benefit can be crucial especially in cases where the server machine can't be customized or accessed (like in cloud computing environment), even if the network performances could be reduced as compared the hardware supported ones.^[14]^[15]^[16]

RPS

Receive Packet Steering (RPS) is the RSS parallel implemented in software. All packets received by the NIC are load balanced between the cores' queues by implementing an hash function using as configurable key the header fields (like the layer 3 source and destination IP and layer 4 source and destination ports), in the same fashion as RSS does. Moreover thanks to the hash properties, packets belonging to the same flow will always be steered to the same core.^[15]
This is usually done in the kernel, right after the NIC driver. Having handled the network interrupt and before it can be processed, the packet is sent to the receiving queue of a core, which is then notified thanks to an inter process interrupt.^[14]
RPS can be used in conjunction with RSS, in case the number of queues managed by the hardware is lower than the number of cores. In this case after having distributed across the RSS queues the incoming packets, a pool of cores can be assigned to each queue and RPS will be used to spread again the incoming flows across the specified pool.^[13]

RFS

Receive Flow Steering (RFS) upgrades RPS in the same direction as the aRFS hardware solution does. By routing packet flows to the same CPU core running the consuming application, cache locality can be improved and leveraged, avoiding many misses and reducing the latencies introduced by the retrieval of the data from the central memory.^[17]
To do this, after having computed the hash of the header fields for the current packet, the result is used to index a lookup table. This table is managed by the scheduler, which updates its entries when the application processes are moved between the cores.^[18]
The overall CPU load distribution is balanced as long as the applications in user-space are evenly distributed across the multiple cores.^[16]^[18]

XPS (in transmission)

Transmit Packet Steering (XPS) is a transmission protocol, as opposed to the others that have been mentioned so far. When packets need to be loaded on one of the transmission queues exposed by the NIC, there are again many possible optimization that could be done.^[19]
For instance if multiple transmission queues are assigned to a single core, an hash function could be used to load balance outgoing packets across the queues (similarly to how RPS does in reception). Moreover in order to improve cache locality and hit-rate (similarly to how RFS does), XPS ensures that applications producing the outgoing traffic and running in core i will favor the transmitting queues associated with the same core i. This reduces the inter-core communication and cache coherency protocols overheads, resulting in better performances in heavy load environments.^[20]^[21]^[2]

References

^ ^a ^b Barbette, Tom; Katsikas, Georgios P.; Maguire, Gerald Q.; Kostić, Dejan (2019-12-03). "RSS++: load and state-aware receive side scaling". Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies. CoNEXT '19. New York, NY, USA: Association for Computing Machinery. doi:10.1145/3359989.3365412. ISBN 978-1-4503-6998-5.
^ ^a ^b Madden, Michael M. (2019-01-06), "Challenges Using the Linux Network Stack for Real-Time Communication", AIAA Scitech 2019 Forum, AIAA SciTech Forum, American Institute of Aeronautics and Astronautics, pp. 9–11, doi:10.2514/6.2019-0503, retrieved 2025-07-10
^ Herbert, Tom (2025-02-24). "The alphabet soup of receive packet steering: RSS, RPS, RFS, and aRFS". Medium. Retrieved 2025-07-10.
^ ^a ^b "RSS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.
^ ^a ^b "RSS overview by microsoft". learn.microsoft.com. Retrieved 2025-07-08.
^ Wu, Wenji; DeMar, Phil; Crawford, Matt (2011-02-01). "Why Can Some Advanced Ethernet NICs Cause Packet Reordering?". IEEE Communications Letters. doi:10.1109/LCOMM.2011.122010.102022. ISSN 1558-2558.
^ ^a ^b "aRFS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.
^ ^a ^b ^c "aRFS by nvidea". docs.nvidia.com. NVIDIA Documentation Hub. Retrieved 2025-07-08.
^ "RSS intel doc" (PDF). earn.microsoft.com. Retrieved 2025-07-08.
^ "RSS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.
^ "Receive-side scaling enhancements in windows server 2008". microsoft.com. Microsoft. Retrieved 2025-07-08.
^ "aRFS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.
^ ^a ^b "RPS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.
^ ^a ^b "RPS linux news (LWM)". lwn.net. Linux Weekly News. Retrieved 2025-07-08.
^ ^a ^b "RPS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.
^ ^a ^b "RFS by nvidea". docs.nvidia.com. NVIDIA Documentation Hub. Retrieved 2025-07-08.
^ "RFS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.
^ ^a ^b "RFS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.
^ "XPS linux news (LWM)". lwn.net. Linux Weekly News. Retrieved 2025-07-08.
^ "XPS intel overview". intel.com. Intel corp. Retrieved 2025-07-08.
^ "XPS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.

External links

[RSS++-1] Barbette, Tom; Katsikas, Georgios P.; Maguire, Gerald Q.; Kostić, Dejan (2019-12-03). "RSS++: load and state-aware receive side scaling". Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies. CoNEXT '19. New York, NY, USA: Association for Computing Machinery. doi:10.1145/3359989.3365412. ISBN 978-1-4503-6998-5.

[General_intro-2] Madden, Michael M. (2019-01-06), "Challenges Using the Linux Network Stack for Real-Time Communication", AIAA Scitech 2019 Forum, AIAA SciTech Forum, American Institute of Aeronautics and Astronautics, pp. 9–11, doi:10.2514/6.2019-0503, retrieved 2025-07-10

[3] Herbert, Tom (2025-02-24). "The alphabet soup of receive packet steering: RSS, RPS, RFS, and aRFS". Medium. Retrieved 2025-07-10.

[RSS_kernel_linux_docs-4] "RSS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.

[RSS_overview_by_microsoft-5] "RSS overview by microsoft". learn.microsoft.com. Retrieved 2025-07-08.

[6] Wu, Wenji; DeMar, Phil; Crawford, Matt (2011-02-01). "Why Can Some Advanced Ethernet NICs Cause Packet Reordering?". IEEE Communications Letters. doi:10.1109/LCOMM.2011.122010.102022. ISSN 1558-2558.

[aRFS_by_redhat-7] "aRFS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.

[aRFS_by_nvidea-8] "aRFS by nvidea". docs.nvidia.com. NVIDIA Documentation Hub. Retrieved 2025-07-08.

[9] "RSS intel doc" (PDF). earn.microsoft.com. Retrieved 2025-07-08.

[10] "RSS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.

[11] "Receive-side scaling enhancements in windows server 2008". microsoft.com. Microsoft. Retrieved 2025-07-08.

[12] "aRFS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.

[RPS_kernel_linux_docs-13] "RPS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.

[RPS_linux_news_(LWM)-14] "RPS linux news (LWM)". lwn.net. Linux Weekly News. Retrieved 2025-07-08.

[RPS_by_redhat-15] "RPS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.

[RFS_by_nvidea-16] "RFS by nvidea". docs.nvidia.com. NVIDIA Documentation Hub. Retrieved 2025-07-08.

[RFS_by_redhat-17] "RFS by redhat". docs.redhat.com. Red Hat Documentation. Retrieved 2025-07-08.

[RFS_kernel_linux_docs-18] "RFS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.

[19] "XPS linux news (LWM)". lwn.net. Linux Weekly News. Retrieved 2025-07-08.

[20] "XPS intel overview". intel.com. Intel corp. Retrieved 2025-07-08.

[21] "XPS kernel linux docs". kernel.org. The Linux Kernel documentation. Retrieved 2025-07-08.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing