Data centers increasingly deploy commodity servers with high-speed network interfaces to enable low-latency communication. However, achieving low latency at high data rates crucially depends on how the incoming traffic interacts with the system’s caches. When packets that need to be processed in the same way are consecutive, i.e., exhibit high temporal and spatial locality, caches deliver great benefits.
In this paper, we systematically study the impact of temporal and spatial traffic locality on the performance of commodity servers equipped with high-speed network interfaces. Our results show that (i) the performance of a variety of widely deployed applications degrade substantially with even the slightest lack of traffic locality, and (ii) a traffic trace from our organization reveals poor traffic locality as networking protocols, drivers, and the underlying switching/routing fabric spread packets out in time (reducing locality).
To address these issues, we built Reframer, a software solution that deliberately delays packets and reorders them to increase traffic locality. Despite introducing μs-scale delays of some packets, we show that Reframer increases the throughput of a network service chain by up to 84% and reduces the flow completion time of a web server by 11% while improving its throughput by 20%.
After encountering novel challenges arising at 100G speeds, a follow-up longer version of our MiddleClick paper has been published in the IEEE/ACM Transaction on Networking journal in 2021 with hardware offloading, and an improved algorithm for combining sessions.
The code has been reverted into FastClick, allowing to have unique state management for multiple VNFs, automatically combined. On top of this session system, one can easily modify TCP or HTTP streams on the fly without full termination!
Check out the paper ! The code has been merged to FastClick.
In this journal version, we extended our conference paper with additional, peer-reviewed material:
We implemented our system on QUIC using P4 and Picoquic. This demonstrates that our approach does not depend solely on TCP timestamps. The code in ‘bmv2’ and ‘p4-tofino’ has been made publicly available. All of our code is available at https://github.com/cheetahlb/.
We added an experiment using the Tofino implementation and the QUIC implementation of Cheetah for an HTTP webserver.
We added an experiment to verify whether today’s OSes support TCP timestamp, have them enabled by default, and correctly echo the TCP timestamp set by a server.
We added an experiment to verify the granularity of the TCP timestamp units used by some of the largest Alexa top 100 websites.
We added a proof sketch on the size of the cookies given a number of servers.
We added an implementation in bmv2 of the “TCP timestamp”-based system. We have also rewritten and published the P4- tofino code of the system. The implementation of the stateful LB is non-trivial as it requires the insertions/lookups/deletions operations to be applied in constant time (and more restrictions apply). We describe our implementation of a stack-based data structure for the Tofino in Section 4.3.
We added a micro-benchmark of the performance of the Cheetah LB, e.g., compared SYN insertions with cuckoo, normal packets,
We broke down the benefits of SSE parsing of TCP options instructions.
We evaluated the packet processing latency overheads of realizing Cheetah on a Tofino for both the TCP timestamp and QUIC implementation.
We clarified the design challenges in the introduction.
Georgios P. Katsika, Tom Barbette, Dejan Kostić, JR. Gerald Q. Maguire, Rebecca Steinert
The NSDI version of Metron supported the integration of blackbox network functions (NFs) using ring buffers. This choice limited Metron’s applicability, as real networks might contain hardware blackboxes (also known as middleboxes) or closed-source blackbox binaries running inside virtual machines (VMs) or containers. In this extended journal version published in ACM Transaction on Computer Systems, we put special effort on integrating these important blackbox types into Metron, while maintaining Metron’s hardware-level performance.
This integration was not trivial as it involved tedious low-level system aspects related to (i) efficiently dispatching packets without introducing unnecessary inter-core communication and (ii) techniques to allow high-speed service chaining. These were key principles of Metron that we wanted to maintain. Moreover, we incorporated the latest functionalities of modern 100 GbE NICs, such as single root I/O virtualization (SR-IOV) that enables physical to virtual NIC dispatching, avoiding the need for software switching. Metron instructs the physical NIC to tag the packets according to the core associated with a traffic class by the controller. The tag can then be used to dispatch packets to queues just as a Metron agent does.
As appeared in USENIX NSDI 2018, the original Metron system demonstrated an experiment on dynamic scaling at 10 Gbps. 100 GbE deployments are becoming the new commodity. Therefore, we put substantial effort on refining Metron’s scaling algorithm. Part of this algorithm uses our new method for deriving the load of a CPU core even when this core performs NIC polling (e.g., using DPDK poll mode drivers).
The 100 GbE testbed used in the NSDI version of Metron exhibited hardware limitations that prevented Metron from reaching line-rate performance. In this journal, we repeated the same experiment on two additional testbeds: First we upgraded the 100 GbE NICs of the original testbed (i.e., replacing the Mellanox ConnectX-4 with newer Mellanox ConnectX-5 NICs) and managed to increase the maximum throughput at 85 Gbps (76 Gbps was the previous limit). Then, we also upgraded the servers of the testbed using new workstations with Intel’s Skylake hardware architecture (the old servers used Intel’s Haswell hardware architecture) and managed to achieve line-rate 100 Gbps packet processing.
The paper also presents a dozen other novelties compared to the NSDI version, so check it out!
Our paper “High-speed Connection Tracking in Modern Servers” will be presented by Massimo Girondi at the IEEE HPSR 2021, the 22nd International Conference on High-Performance Switching and Routing.
We have analyzed the performances of six different Hash Tables implementations, studying how to scale them across multiple cores and how to efficiently remove expired entries, benchmarking them with up to 100 Gbps traffic.
This is joint work Marco Chiesa and Massimo Girondi, the first author.
Yesterday we presented our recent work at PAM 2021 in which we study the classification performance of NVIDIA Mellanox NICs and uncover some performance bottlenecks. Watch out when relying on hardware offloads, they can be very nice but come with drawbacks when trying to push to many rules or too many tables.
We’re pleased to announce our upcoming paper “What you need to know about (Smart) Network Interface Cards” by G. Katsikas, T. Barbette, M. Chiesa, D. Kostic and G. Maguire Jr. is accepted at PAM’21!
We can’t publish the preprint yet, but we’ll do as soon as possible 🙂
Network interface cards (NICs) are fundamental components of modern high-speed networked systems, supporting multi-100 Gbps speeds and increasing programmability. Offloading computation from a server’s CPU to a NIC frees a substantial amount of the server’s CPU resources, making NICs key to offer competitive cloud services. Therefore, understanding the performance benefits and limitations of offloading a networking application to a NIC is of paramount importance. In this paper, we measure the performance of four different NICs from one of the largest NIC vendors worldwide, supporting 100 Gbps and 200 Gbps. We show that while today’s NICs can easily support multi-hundred-gigabit throughputs, performing frequent update operations of a NIC’s packet classifier — as network address translators (NATs) and load balancers would do for each incoming connection — results in a dramatic throughput reduction of up to 70 Gbps or complete denial of service. Our conclusion is that all tested NICs cannot support high-speed networking applications that require keeping track of a large number of frequently arriving incoming connections. Furthermore, we show a variety of counter-intuitive performance artefacts including the performance impact of using multiple tables to classify flows of packets.
Early February we presented a talk at FOSDEM, a huge Open-Source gathering with my colleague Alireza Farshin. The video is now released!
In the talk we present FastClick with a short demo, do a round of existing alternative modular framework (VPP and BESS mainly) and then discuss the future of software dataplanes, which we believe our recent work PacketMill starts to address.
We mainly show how FastClick is still really up-to-date with competition and goes beyond sota with PacketMill’s enhancements. We also re-did an experiment at 100G showing how FastClick now improves Click by more than 30x in a forwarding configuration. This is because we continued to maintain FastClick since nearly 6 years now and we do consider pull requests, and integrate recent research while good old Click itself is sadly stalling since a decade now. I will do a blog post about the state of FastClick in the next weeks.
I also bought the www.fastclick.dev domain to start a little showcase website. For now it redirects to GitHub. Feel free to help 🙂