A word about mode 3

First, you probably noted that it’s mostly a bonus, start the mode 3 code when everything else is finished.

We already saw in the first slides how the rx_rings of the e1000 NIC work. Normally you also now the main receive function : e1000_clean_rx_irq.

There is one struct e1000_rx_ring per hardware RX ring. Normaly, there is only one ring (and you should only care about one for step 7).

I think the fields are pretty well defined (for once…) :

191 struct e1000_rx_ring {
192         /* pointer to the descriptor ring memory */
193         void *desc;
194         /* physical address of the descriptor ring */
195         dma_addr_t dma;
196         /* length of descriptor ring in bytes */
197         unsigned int size;
198         /* number of descriptors in the ring */
199         unsigned int count;
200         /* next descriptor to associate a buffer with */
201         unsigned int next_to_use;
202         /* next descriptor to check for DD status bit */
203         unsigned int next_to_clean;
204         /* array of buffer information structs */
205         struct e1000_rx_buffer *buffer_info;
206         struct sk_buff *rx_skb_top;
208         /* cpu for rx queue */
209         int cpu;
211         u16 rdh;
212         u16 rdt;
213 };

The desc is a pointer to the memory zone, containing the ring. This memory zone is accessible by the NIC itself through the dma address. The ring is a contiguous zone of e1000_rx_desc structures. Usually you access it with

E1000_RX_DESC(R, i)

where R is the ring pointer and i the index of the descriptor.

One descriptor i composed of :

522 struct e1000_rx_desc {
523         __le64 buffer_addr;     /* Address of the descriptor's data buffer */
524         __le16 length;          /* Length of data DMAed into data buffer */
525         __le16 csum;            /* Packet checksum */
526         u8 status;              /* Descriptor status */
527         u8 errors;              /* Descriptor Errors */
528         __le16 special;
529 };

The buffer_addr variable is a pointer to the physical memory address where a packet can be received. The problem is that the Kernel normal put hte address of a skbuff->data there. But when a packet is received and its content is copied inside that buffer, how to get back the corresponding skbuff? This is why the e1000_rx_ring structure has also a buffer_info pointer. There is exactly as many buffer_info than e1000_rx_desc.  The buffer info contains all the software-only information that we need to keep about each descriptors, such as the skbuff which will receive the data pointed by buffer_addr.

Before receiving any packets, all buffer_addr have to be setted ! Or the NIC wouldn’t know where to put the data. This is done in e1000_configure :

407         for (i = 0; i < adapter->num_rx_queues; i++) {
408                 struct e1000_rx_ring *ring = &adapter->rx_ring[i];
409                 adapter->alloc_rx_buf(adapter, ring,
410                                       E1000_DESC_UNUSED(ring));
411         }

alloc_rx_buf is a pointer to the function e1000_alloc_rx_buffers (if you don’t use jumbo frame, and you shouldn’t here).  We see that this functions is called for all rx rings.

The function e1000_alloc_rx_buffers is defined at line 4561.  It calls “e1000_alloc_frag” for each buffers between rx_ring->next_to_use (initialized to 0) up to cleaned_count (in configure, the size of the ring is passed, so this is equivalent to a full reset).

The memory obtained from e1000_alloc_frag is (probably) not accessible by hardware, so we have to map “DMA map” it :

4613                 buffer_info->dma = dma_map_single(&pdev->dev,
4614                                                   data,
4615                                                   adapter->rx_buffer_len,
4616                                                   DMA_FROM_DEVICE);

You will have to do this with your fastnet buffers !

Then the result in dma is putted inside buffer_addr (line 4650).

When some packet are received, e1000_clean_rx_irq will make skbuff out of them, and we’ll need to allocate new buffers and put pointers to them inside the ring so next packet can be received. This is done at the end of e1000_clean_rx_irq :

4481         cleaned_count = E1000_DESC_UNUSED(rx_ring);
4482         if (cleaned_count)
4483                 adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);

So when the device goes in mode 3 you have to :

  • Change the adapter->alloc_rx_buf function per your new one which will set buffer_addr to dma mapped fastnet buffers.
  • Do a full pass of a new alloc_rx_buf to replace all current skb buffers by fastnet buffers so all new packets will be received directly in the fastnet zone.

When a packet is received :

  • In fastnet mode 3, do not call any skb related function in clean_rx_irq : from line 4385 to 4401 which creates the skb, line 4424, line 4438, and 4453 to 4463.
  • Instead of line 4464 calling the kernel receive path, do the copy of the packet length to the fastnet buffer descriptor. Note that as for skb in buffer_info, you will need to keep a pointer to the related fastnet descriptor. There is no need to call the kernel receive path anymore : the whole purpose of mode 3 is to directly receive the packet in userspace. Just set the length so the user knows there is a packet available inside the corresponding buffer !

What I want with a “generic way” is that the least possible code has to be written inside e1000, so do not implement any fastnet descriptor related code in e1000 : create a function in fastnet.c to get the next fastnet buffer that you will map, and use net_dev_ops or adapter-> functions so that fastnet ioctl can call a generic function like dev->ops->set_in_fastnet_zc_mode that any driver may or may not implement.


Profiling of Socket, PCAP and Fastnet to receive RAW packets in userspace

The code source I presented is available at :


I’m willing to merge any pull request (called merge request on gitlab, so that they don’t copy github too much^^). Especially for parsing arguments, having options like “-b” to set the buffer size instead of recompiling, …

Since the presentation I added a “do_something” function called for each received packets of each methods, it will simply read bytes 12 and 13 of the ethernet header and check if it’s an IP packet, and sum up the amount of IP packets. As I said in class, this allows to effectively read the content of the packet and is a much better benchmarking, as nobody receive raw packets in userspace to do nothing with them… So you will hit memory for each packets. For nearly all method you just memcopied the content to userspace, but in mode 3 the NIC writes directly the packet to the buffer so when you’ll access the content you’ll loose ~300cpu cycles just to wait for the packet content to be bringed to cache, so it would be unfair to just get the packet length and not the data.

I also added a “socket” method, using the standard Linux socket which works like the fastnet read function. As you’ll probably find out, PCAP has already a big advantage using a special feature of the Kernel to receive much like what we do in Fastnet Step 7 : packet_mmap. I imagine you’ll be happy to see that you, humble student, can already do better than packet_mmap which is the best that Linux can offer to receive raw packets. Those doing the implementation for mode 3 will be able to see how much faster we can go.

If people are interested to change the mode 3 to make it work with packet_mmap and try to submit a patch to the linux networking team, they can contact me and we’ll do it together. It’s time to make Linux move regarding fast packet capture and if our patch is not accepted, it will at least piss of some people and make linux move in the right direction… I know you have a “personal project” course in master 1 that can surely credit this task.

Some of the commands I use in class :

sudo ip link add veth0 type veth peer name veth1

sudo ifconfig eth2 up

sudo tcpreplay -l 0 -q -i vboxnet0 –preload packet

sudo tcpreplay -l 0 –mbps=100.0 -q -i vboxnet0 –preload packet

The 64byte UDP packet I made is available at https://www.tombarbette.be/packet

If you use a bridge, update the packet to set its source and destination mac address to the mac address of the bridge on your host and the mac address of the ethernet port on your virtual machine.

Step 7 : Go back to mode 0 on file close

The Step 7 will be easier if you go back to mode 0 when the file descriptor is closed. It can happen if the user call close(fd) , or it will be automatically done when application exits.  So there won’t be a device in fastnet mode anymore without an open file descriptor. It also means that module_exit has nothing to do on exit : if you putted .owner = THIS_MODULE correctly, it won’t be closed until there is still open file descriptors, so when the module is removed you know that there is no device in fastnet mode.

This seems more like a file should be handled, but mimics less the syscall, as there is nothing “opened” after a syscall, there is no state, no “session” and therefore no “close”.

This also avoid the need to have a list of devices currently in fastnet mode, which makes it easier. A lot of groups did use the first_net_device facility but that won’t go over a device currently being removed but blocked by the dev_hold/dev_put reference. So the exit function was a little messy, this should make it easier… Multiple groups also called their “free_list_in_buffer” function on all devices given by first_net_device. If the rx_handler (and therefore the rx_handler_data) wasn’t yours, you’ll follow an unknown pointer and free things randomly… Bad ! In the close function, you know it’s your device, you know its current mode, so it’s easy. The specification that there is only one file descriptor per device, and one device per file description is still true, so the close function is pretty straight forward.

The assignment has been updated accordingly.

Update about Step 6

I updated the Step #6 with more specifications as I received quite a lot of questions. Mostly :

  • The buffers are indeed per-device, so if we have two open()+ioctl()+read() sequences about two different device, each read() will give packets from the device previously set by the ioctl() but not randomly from any of the devices.
  • I add the “user” constraint that only ioctl about the same devices can be given to one file descriptor. Each open operation create one new and independent file descriptor. You can think as /dev/fastnet as an “entry point”, but each open() give in fact a very different file, where the content of read() will depend on the previous ioctls.

You can use the private_data structure of the file* filp to remember things like the device passed to the ioctl in the read.

If you have already gone in a too different direction because you misunderstood something (or I wasn’t clear enough…) and don’t want to go back, contact me.

3 reasons why it is better to use rx_handler_private than adding fastnet_mode

Yes, it’s not in my corrected code on gitlab, but well, do what I say, not what I do.

– Modifying netdevice.h will force to rebuild and reinstall all modules that include it. If you don’t you’ll have errors on boot/when loading module for the old header.
– The pointer + private pointer is method is found in a lot of place in the kernel. Usually, you want to do some stuffs (a function) when something is done. That is called a callback. But usually you need some information, some context with this. When you ask de disk for some data, and you finally get back the block (after an IRQ), having just the block you asked is not enough… You’ll need to remember what you wanted to do with that block. You’ll find “private” structure in a lot of places. But pay attention : is it private for you, or private for someone else? In the case of the rxhandler, it’s pretty obvious as rx_handler_data (the private pointer) is set via register_handler : the owner is the one using rxhandler_register_data. But you also have a private space in netdevice for example (accessed through netdev_priv). For who it is private? Its intended user is the device owner, so the driver. If you modify it, you will corrupt the driver… So if you want to retain some per-device data but you’re not the creator of this net_device struct, this is absolutely not the way to go…
– If you add a variable, you have to initialize it, and add again some code.

I didn’t formalize about it in step 4 and 5, because I didn’t tried it… And because the kernel developers are really bastards some time (http://lxr.free-electrons.com/source/include/linux/netdevice.h?v=4.1#L1428 seriously? It would be faster to describe it than writing that…).

For Step 6, it’s more arguable if you need to add a fastnet field or not, as you also access a buffer of skbuff from the file operations. I didn’t specify what to do when the ioctl is called to go back to 0, but there is still packets to read… What should not happen is to leak the memory.

Note the __rcu macro before rx_handler_private definition in net_device.c… Some of you forgot to use rcu-specific accessors for dereferencing. I didn’t penalized either the usage of rcu or not, but for step 6, if you use a rcu pointer, you’ll have to do it right. Note that you shouldn’t play with RCU in your fastnet syscall/ioctl as netdev_rx_handler_register do it for you. Same for netdev_rx_handler_unregister. What you need to do is to use rcu_dereference in your rx_handler code to access the private data.

So next time, use RCU correctly, and try to use diverses private data when you register things. File also have some space for their backing file_operations 😉

consume_skb vs kfree_skb

Looking at the documentation of the function (or simply online at https://kernel.org/doc/htmldocs/networking/API-consume-skb.html) you’ll find that kfree_skb is intended to be used on error.

In our rx_handler, it’s a feature, not a bug. So consume_skb is better. If you used kfree_skb, It was okay regarding marking for Step 4.

Be warned that my last test will make a transfer of a big file through a mode 1 interface… If you forgot the free, it will fill the memory !


Step #5 update and Step #4 correction

Please note the deadline for Step 5 correction has been changed to tuesday. I also added a note about the ioctl number : you should define it correctly, but do not check for it inside your IOCTL, as I cannot find automatically the number you’ll use. I will personally use 1 as the command argument to call your ioctl. So just ignore the command argument in your ioctl. Remember the mode/dev is given as a structure pointer in the third argument, the command number is normally used to have multiple functionalities using one function, using kind of a big switch/case inside the ioctl. We only have one : “set fastnet mode”. So it’s not a big deal.

I added a “false” project on the platform, a clone  of the step 4 to allow you to correct your Step 4 code if you want and test it on the platform by re-submitting as many times as you want. I also pushed my code to Gitlab, if you find bugs or problems, tell me ! As the project begin slowly to be bigger, you may catch things I forgot… I remember I told something in class I forgot in my code, but I don’t remember what it was…

The Step 5 will come on the platform ASAP…

Remember that you have to completely remove the system call and any definition made for it. But keep the messages from step 2, the credits from step 1, …

A reminder about packets in networks

An ethernet packet is made in layers. Each layer is inside another layer.

The content of the skbuff created in e1000_clean_rx_irq is the whole ethernet frame. The Ethernet frame starts with an ethernet header, and then contains the ethernet payload.

You do not receive the preamble (it’s just data to mark the beginning of the frame, always the same so useless to copy) and usually not the CRC either as all NICs can check it is correct for you (it will be removed in e1000_main.c:4451 if it’s there.)

Then how to know what’s in the data? Well, it will be given by two bytes starting from the 12th byte.

The known types are defined in if_ether.h, for example you can find the type of the IP packets there :

#define ETH_P_IP 0x0800 /* Internet Protocol packet */

So you know that somewhere, the “thing” handling the packets of the IP protocol will check that the type equals ETH_P_IP. It will in fact check if the type is cpu_to_be16(ETH_P_IP), because in the network, bytes are big-endian, while the CPU use little-endian. It means that in the network, 0x0800 will be 0x0080 as the most significant byte will be on the right. There is a lot of packets types, not just IP. Do not expect to find a “if (type == cpu_to_be16(ETH_P_IP))”… The kernel use a list of structure of known packet type and check the whole list against the actual packet type.

The kernel will call the handler defined for the specific protocol matching the IP packet.

The same scheme applies more or less for every layer, as the IP packet itself is also composed of a header with a type, and some data. And inside it there is again a UDP, TCP, ICMP, … packet with again a header and some data.

For example the type of the payload of the IP packet is given as a unique byte (so no byte order problem here) in the 9th byte of the IP header :

As there is only 256 possible types, the linux kernel use a table, and not a list as it allows to directly jump to good “sub” ip layer handler such as the one for UDP, TCP, …

IP protocols are defined in in.h for example we have IPPROTO_UDP = 17, /* User Datagram Protocol */ at line 40, which tells us that 17 is the UDP protocol. Again, a quick search on who use IPPROTO_UDP in the /net/ipv4 folder will tell us who is defining some kind of handler to handle that kind of packets with the IP protocol set to 17. A hint : there is a function which will set the good index of the “protocol table” to the structure containing informations about the protocol. So it’s not like the ethernet layer where the list contains the type and the function, here the protocol number is not in the handler structure 😉