Our latest paper “Cheetah”, a load balancer that guarantees per-connection-consistency

Cheetah is a new load balancer that solves the challenge of remembering which connection was sent to which server without the traditional trade off between uniform load balancing and efficiency. Cheetah is up to 5 times faster than stateful load balancers and can support advanced balancing mechanisms that reduce the flow completion time by a factor of 2 to 3x without breaking connections, even while adding and removing servers.

More information at https://www.usenix.org/conference/nsdi20/presentation/barbette.

Dynamic DNS with OVH

It may not be a clear thing, but OVH allows to have your own Dynamic DNS if you rent a domain name, surely a better thing than the weird paid website from dyndns.org. I will explain how to handle the update with Linux using ddclient.

On the manager

Connect to https://www.ovh.com/manager/web/#/configuration/domain/ , select your domain name, and create a new dynhost with the button on the right.

Enter a sub-domain name such as “mydyn” (.tombarbette.be), and add the actual IP for now, or just 8.8.8.8 for the time being.

Then it is not finished, you have to create a login that will be able to update that dns entry. Select the second button to handle accesses and create a new login.

Select a login, probably the name of the subdomain, the subdomain itself, and a password.

On the server

sudo apt install ddclient

Then edit /etc/ddclient.conf

protocol=dyndns2
use=web,web=checkip.dyndns.com
server=www.ovh.com
login=tombarbette.be-mydns
password='password'
mydns.tombarbette.be

Just do “sudo ddclient” to update once then “sudo service ddclient restart” to get it updated automatically.

May this be helpful to someone, personally I just forget it all the time so I wanted to leave a post-it somewhere.

Our new paper RSS++: load and state-aware receive side scaling

I’m delighted to announce the publication of our latest paper titled “RSS++: load and state-aware receive side scaling” at CoNEXT’19.

Abstract

While the current literature typically focuses on load-balancing among multiple servers, in this paper, we demonstrate the importance of load-balancing within a single machine (potentially with hundreds of CPU cores). In this context, we propose a new load-balancing technique (RSS++) that dynamically modifies the receive side scaling (RSS) indirection table to spread the load across the CPU cores in a more optimal way. RSS++ incurs up to 14x lower 95th percentile tail latency and orders of magnitude fewer packet drops compared to RSS under high CPU utilization. RSS++ allows higher CPU utilization and dynamic scaling of the number of allocated CPU cores to accommodate the input load while avoiding the typical 25% over-provisioning.

RSS++ has been implemented for both (i) DPDK and (ii) the Linux kernel. Additionally, we implement a new state migration technique which facilitates sharding and reduces contention between CPU cores accessing per-flow data. RSS++ keeps the flow-state by groups that can be migrated at once, leading to a 20% higher efficiency than a state of the art shared flow table.

Paper ; Video ; Slides

Our latest paper “Metron: NFV Service Chains at the True Speed of the Underlying Hardware” has been accepted at NSDI 2018 !

Abstract

In this paper we present Metron, a Network Functions Virtualization (NFV) platform that achieves high resource utilization by jointly exploiting the underlying network and commodity servers’ resources. This synergy allows Metron to: (i) offload part of the packet processing logic to the network, (ii) use smart tagging to setup and exploit the affinity of traffic classes, and (iii) use tag-based hardware dispatching to carry out the remaining packet processing at the speed of the servers’ fastest cache(s), with zero intercore communication. Metron also introduces a novel resource allocation scheme that minimizes the resource allocation overhead for large-scale NFV deployments. With commodity hardware assistance, Metron deeply inspects traffic at 40 Gbps and realizes stateful network functions at the speed of a 100 GbE network card on a single server. Metron has 2.75-6.5x better efficiency than OpenBox, a state of the art NFV system, while ensuring key requirements such as elasticity, fine-grained load balancing, and flexible traffic steering.

Paper, slides & video

Do HUAWEI CloudEngine switches support OpenFlow?

No, no and no.

Despite what the ONF says (https://www.opennetworking.org/product-registry/) it is not. Huawei’s OpenFlow implementation is actually broken. The very first  HELLO OpenFlow message is broken. It reports support for OpenFlow 1.4 in the HELLO message, but the rest of the message is absolutely not structured as defined in the standard.

After contacting all parties, it is clear that nobody will move about that, especially HUAWEI which wants to sell the Agile controller for a high price. It would appear that an old firmware, announcing OpenFlow 1.3 was compliant at the certification time but only if using an old software compliant with OpenFlow 1.3.0 and not newer, as starting with 1.3.1 after that the message is broken too.

Funny, I recently bought a HUAWEI smartphone that had trouble with SmartWatches. The seller told me that most smartwatches worked with every phones except Huawei ones, because their bluetooth implementation is not compliant. Seems to be a habit…

Project 4 deadline extended

Deadline extended to Thursday 13:59

Please take this time to write tests and merge them using the git at https://gitlab.montefiore.ulg.ac.be/INFO0940/project4scripts

Please already try to submit now with a barely working sys_pfstat allowing obvious error to be catched as early as possible.

Fork & Exec system calls to implement a shell

A shell, typically parse a command, then fork (duplicates itself).

The duplicated process replaces itself using Execvp by the program described in the command

The other process wait for the duplicated one to exit using waitpid. When that happens it prints the prompt again, ready for the next one to come. And the whole thing restarts again.

Not a very hard life.

Here are the important parts of the manpages of the 3 functions :

Fork

NAME
fork – create a child process

SYNOPSIS
#include <unistd.h>

pid_t fork(void);

DESCRIPTION
fork() creates a new process by duplicating the calling process. The
new process is referred to as the child process. The calling process
is referred to as the parent process.

The child process and the parent process run in separate memory spaces.
At the time of fork() both memory spaces have the same content.

RETURN VALUE
On success, the PID of the child process is returned in the parent, and
0 is returned in the child. On failure, -1 is returned in the parent,
no child process is created, and errno is set appropriately.

Execvp

EXEC(3) Linux Programmer’s Manual EXEC(3)

NAME
execvp – execute a file

SYNOPSIS
#include <unistd.h>
int execvp(const char *file, char *const argv[]);

DESCRIPTION

The execvp() function replaces the current process image with
a new process image.

The initial argument for this function is the name of a file that is
to be executed.

The const char *arg can be thought of as arg0, arg1, …, argn.
Together they describe a list of one or more pointers to null-termi‐
nated strings that represent the argument list available to the exe‐
cuted program. The first argument, by convention, should point to the
filename associated with the file being executed. The list of argu‐
ments must be terminated by a null pointer, and, since these are vari‐
adic functions, this pointer must be cast (char *) NULL.

RETURN VALUE
The exec() functions return only if an error has occurred. The return
value is -1, and errno is set to indicate the error.

waitpid

NAME
waitpid – wait for process to change state

SYNOPSIS

pid_t waitpid(pid_t pid, int *status, int options);

DESCRIPTION

The waitpid() system call suspends execution of the calling process
until a child specified by pid argument has changed state. By default,
waitpid() waits only for terminated children.

Home-Assistant : live camera feed and motion detection with a USB camera using motion

I want to display my webcam feed on home assistant. That’s easy and well explained on home assistant’s website. However they do not tell how to implement a motion detection system at the same time.

First step : set up the camera live feed as explained in the docs

In your configuration.yaml

camera:
- platform: mjpeg
mjpeg_url: http://localhost:8081
name: Salon

Install motion :

sudo apt-get install motion

Configure /etc/motion/motion.conf (change these values 🙂

daemon on
stream_port 8081
stream_quality 80
stream_maxrate 12
stream_localhost on

And then restart motion :

sudo service motion restart

And home assistant, then the webcam should appear ! Yeah !

Now the motion detection. The method I took is to use the mqtt protocol. A binary sensor will be the state of motion detection, motion will publish updates to the given topic to say if motion is on or off, and home assistant will subscribe to it.

Add this in your HA configuration.yaml

mqtt: #I pass the mqtt setup process
broker: 127.0.0.1
port: 1883
client_id: home-assistant
keepalive: 60
protocol: 3.1

binary_sensor:
- platform: mqtt
state_topic: "living_room/cam1"
name: cam1
sensor_class: motion

Install mosquitto-clients :

sudo apt-get install mosquitto-clients

The commande to start a motion event is :

mosquitto_pub -r -i motion-cam1 -t "living_room/cam1" -m "ON" 

-r sets the retain flag
-i is just a client id
-t is the topic, which should match the configuration in mqtt
-m Sets the message content, ON for motion being detected, OFF for a still image.

Then we have to update motion.conf accordingly:

on_event_start mosquitto_pub -r -i motion-cam1 -t "living_room/cam1" -m "ON"
on_event_end mosquitto_pub -r -i motion-cam1 -t "living_room/cam1" -m "OFF"

And restart motion ! And it’s finished !

A word about mode 3

First, you probably noted that it’s mostly a bonus, start the mode 3 code when everything else is finished.

We already saw in the first slides how the rx_rings of the e1000 NIC work. Normally you also now the main receive function : e1000_clean_rx_irq.

There is one struct e1000_rx_ring per hardware RX ring. Normaly, there is only one ring (and you should only care about one for step 7).

I think the fields are pretty well defined (for once…) :

191 struct e1000_rx_ring {
192         /* pointer to the descriptor ring memory */
193         void *desc;
194         /* physical address of the descriptor ring */
195         dma_addr_t dma;
196         /* length of descriptor ring in bytes */
197         unsigned int size;
198         /* number of descriptors in the ring */
199         unsigned int count;
200         /* next descriptor to associate a buffer with */
201         unsigned int next_to_use;
202         /* next descriptor to check for DD status bit */
203         unsigned int next_to_clean;
204         /* array of buffer information structs */
205         struct e1000_rx_buffer *buffer_info;
206         struct sk_buff *rx_skb_top;
207 
208         /* cpu for rx queue */
209         int cpu;
210 
211         u16 rdh;
212         u16 rdt;
213 };

The desc is a pointer to the memory zone, containing the ring. This memory zone is accessible by the NIC itself through the dma address. The ring is a contiguous zone of e1000_rx_desc structures. Usually you access it with

E1000_RX_DESC(R, i)

where R is the ring pointer and i the index of the descriptor.

One descriptor i composed of :

522 struct e1000_rx_desc {
523         __le64 buffer_addr;     /* Address of the descriptor's data buffer */
524         __le16 length;          /* Length of data DMAed into data buffer */
525         __le16 csum;            /* Packet checksum */
526         u8 status;              /* Descriptor status */
527         u8 errors;              /* Descriptor Errors */
528         __le16 special;
529 };

The buffer_addr variable is a pointer to the physical memory address where a packet can be received. The problem is that the Kernel normal put hte address of a skbuff->data there. But when a packet is received and its content is copied inside that buffer, how to get back the corresponding skbuff? This is why the e1000_rx_ring structure has also a buffer_info pointer. There is exactly as many buffer_info than e1000_rx_desc.  The buffer info contains all the software-only information that we need to keep about each descriptors, such as the skbuff which will receive the data pointed by buffer_addr.

Before receiving any packets, all buffer_addr have to be setted ! Or the NIC wouldn’t know where to put the data. This is done in e1000_configure :

407         for (i = 0; i < adapter->num_rx_queues; i++) {
408                 struct e1000_rx_ring *ring = &adapter->rx_ring[i];
409                 adapter->alloc_rx_buf(adapter, ring,
410                                       E1000_DESC_UNUSED(ring));
411         }

alloc_rx_buf is a pointer to the function e1000_alloc_rx_buffers (if you don’t use jumbo frame, and you shouldn’t here).  We see that this functions is called for all rx rings.

The function e1000_alloc_rx_buffers is defined at line 4561.  It calls “e1000_alloc_frag” for each buffers between rx_ring->next_to_use (initialized to 0) up to cleaned_count (in configure, the size of the ring is passed, so this is equivalent to a full reset).

The memory obtained from e1000_alloc_frag is (probably) not accessible by hardware, so we have to map “DMA map” it :

4613                 buffer_info->dma = dma_map_single(&pdev->dev,
4614                                                   data,
4615                                                   adapter->rx_buffer_len,
4616                                                   DMA_FROM_DEVICE);

You will have to do this with your fastnet buffers !

Then the result in dma is putted inside buffer_addr (line 4650).

When some packet are received, e1000_clean_rx_irq will make skbuff out of them, and we’ll need to allocate new buffers and put pointers to them inside the ring so next packet can be received. This is done at the end of e1000_clean_rx_irq :

4481         cleaned_count = E1000_DESC_UNUSED(rx_ring);
4482         if (cleaned_count)
4483                 adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);

So when the device goes in mode 3 you have to :

  • Change the adapter->alloc_rx_buf function per your new one which will set buffer_addr to dma mapped fastnet buffers.
  • Do a full pass of a new alloc_rx_buf to replace all current skb buffers by fastnet buffers so all new packets will be received directly in the fastnet zone.

When a packet is received :

  • In fastnet mode 3, do not call any skb related function in clean_rx_irq : from line 4385 to 4401 which creates the skb, line 4424, line 4438, and 4453 to 4463.
  • Instead of line 4464 calling the kernel receive path, do the copy of the packet length to the fastnet buffer descriptor. Note that as for skb in buffer_info, you will need to keep a pointer to the related fastnet descriptor. There is no need to call the kernel receive path anymore : the whole purpose of mode 3 is to directly receive the packet in userspace. Just set the length so the user knows there is a packet available inside the corresponding buffer !

What I want with a “generic way” is that the least possible code has to be written inside e1000, so do not implement any fastnet descriptor related code in e1000 : create a function in fastnet.c to get the next fastnet buffer that you will map, and use net_dev_ops or adapter-> functions so that fastnet ioctl can call a generic function like dev->ops->set_in_fastnet_zc_mode that any driver may or may not implement.