wolfspider/build_packet.zig

wolfspider · 2024-12-03T05:23:48Z

As mentioned earlier I have been diligently implementing the netmap API using a new language named "Zig". For the purpose of only keeping what we want within our code it has been an interesting choice. The ability to reach down into the C headers and re-implement the things we need has become necessary. This is because it will allow for a more granular control over what happens next. So, we start from the beginning again to look at how packet transmissions occured with the Python API. The code that we have so far is a much more thorough expression of this:

# transmit at maximum speed until Ctr-C is pressed
cnt = 0         # packet counter
batch = 256
poller = select.poll()
poller.register(nfd, select.POLLOUT)
t_start = time.time()
try:
    cur = txr.cur
    while 1:
        ready_list = poller.poll(2)
        if len(ready_list) == 0:
            print("Timeout occurred")
            break;
        n = txr.tail - cur  # avail
        if n < 0:
            n += num_slots
        if n > batch:
            n = batch
        cur += n
        if cur >= num_slots:
            cur -= num_slots
        txr.cur = txr.head = cur # lazy update txr.cur and txr.head
        nm.txsync()
        cnt += n
except KeyboardInterrupt:
    pass
t_end = time.time()

wolfspider · 2024-12-03T05:31:29Z

It says "transmit at maximum speed" so ok...let's do that. As I continue working on the prospective Zig API we have enough here to transmit at maximum speed. The trickier part is dealing with all of those pointers and memory addresses. Without going into every line of code the main point of using the interface involves grabbing a memory address. The speed netmap runs at actually involves using mmap:

const ret: i32 = NetmapManager_ioctl(self, netmap.NIOCREGIF);

    if (ret == -1) {
        std.debug.print("Error: {s}\n", .{"registration failed"});
        return;
    }

    self._memaddr = netmap.mmap(null, self.nmreq.nr_memsize, netmap.PROT_WRITE | netmap.PROT_READ, netmap.MAP_SHARED, self._fd, 0);

    if (self._memaddr == netmap.MAP_FAILED) {
        self._memaddr = null;
        std.debug.print("Error: {s}\n", .{"mmap failed"});
        _ = netmap.close(self._fd);
        self._fd = -1;
        return;
    }

Zig conveniently allows us to combine all of our supporting headers and the netmap headers together so we can just refer to them for now as "netmap"

A netmap interface takes this address and applies an offset at a specific usize. This pointer arithmetic is what is actually driving the Python library behind the scenes:

pub fn NetmapInterface_build(self: *NetmapManager) i32 {
    const tempaddr: [*c]u8 = std.zig.c_translation.cast([*c]u8, self._memaddr);

    // Check if the memory contains any data (assuming a zero-terminated string or specific marker).
    if (tempaddr[0] == 0) {
        std.debug.print("Pointer is valid, but the first byte is zero.\n", .{});
    } else {
        std.debug.print("Pointer points to: {s}\n", .{tempaddr[0..20]}); // Print up to 10 bytes as a slice
    }

    std.debug.print("nmreq offset: {}\n", .{self.nmreq.nr_offset});

    const addr: [*c]u8 = std.zig.c_translation.cast([*c]u8, self._memaddr) + self.nmreq.nr_offset;

    const iaddr: *netmap.netmap_if = @alignCast(@ptrCast(addr));

    self.memory.interface = .{ ._nifp = iaddr };

    if (self.memory.interface.?._nifp == null) {
        std.debug.print("Error: {s}\n", .{"Netmap Interface null"});
        return -1; // Or handle the null case appropriately
    }

    return 0;
}

wolfspider · 2024-12-03T05:37:48Z

There is a lot I could go into about slots, rings, and flags but the important part comes down to building the transmit ring in order to process packets. The packets are copied via memcpy to their respective buffers and then the ring is moved. This pushes the packets over to the receive side of things.

    var nm: NetmapManager = NetmapManager_new();

    const nm_init = NetmapManager_init(&nm, "/dev/netmap");

    try NetmapManager_open(&nm);

    nm.setIfName("rng1");

    //const nr = NetmapManager_regif(&nm);

    try NetmapManager_register(&nm);

    const rng: NetmapRing = nm.memory.transmit_rings.?.items[0];

    var idx: u32 = 0;

    const ring = rng._ring.?;

    for (rng.slots.?.items) |slot| {
        const msg: []const u8 = buildPacket();

        std.debug.print("buf idx: {}\n", .{ring.slot()[idx].buf_idx});

        const idxi: usize = @intCast(ring.slot()[idx].buf_idx);
        const ofs: usize = @intCast(ring.*.buf_ofs);
        const bfs: usize = @intCast(ring.*.nr_buf_size);

        const urngptr: [*]u8 = @alignCast(@ptrCast(ring));
        const bfaddr = urngptr + ofs + (idxi * bfs);
        const bf: [*c]u8 = @alignCast(@ptrCast(bfaddr));

        const len: u16 = @truncate(msg.len);

        std.crypto.secureZero(
            u8,
            bf[0..msg.len],
        );

        @memcpy(bf[0..msg.len], msg);
        slot._slot.?.len = len;

        std.debug.print("C String: {X}\n", .{bf[0..msg.len]});

        idx += 1;
    }

    //for (0..1000000) |_| {

    var cur: i32 = @intCast(ring.*.cur);

    while (true) {
        const tl: i32 = @intCast(ring.*.tail);

        var n: i32 = @intCast(tl - cur);

        const nslots: i32 = @intCast(ring.*.num_slots);

        if (n < 0) {
            n += nslots;
        }

        if (n > 256) {
            n = 256;
        }

        cur += n;

        if (cur >= nslots) {
            cur -= nslots;
        }

        const ncur: u32 = @intCast(cur);

        ring.*.cur = ncur;

        ring.*.head = ncur;

        _ = NetmapManager_ioctl(&nm, netmap.NIOCTXSYNC);
    }

Sync is then triggered once the pointers to the head and current are moved. This API is still very new and so I will need to spend some time working out more issues but for now this will do. A ring has been constructed within the API and it is being buffered is all we need to know at this point.

wolfspider · 2024-12-03T05:44:22Z

Ok, we are now ready to transfer some packets. These are the results on the receiving end:

004.942100 main_thread [2781] 6.477 Mpps (6.484 Mpkts 106.026 Gbps in 1000967 usec) 256.00 avg_batch 768 min_space
005.943131 main_thread [2781] 6.261 Mpps (6.268 Mpkts 102.490 Gbps in 1001031 usec) 256.00 avg_batch 768 min_space
006.944116 main_thread [2781] 6.242 Mpps (6.248 Mpkts 102.169 Gbps in 1000985 usec) 256.00 avg_batch 768 min_space
007.945121 main_thread [2781] 6.458 Mpps (6.464 Mpkts 105.700 Gbps in 1001005 usec) 256.00 avg_batch 768 min_space
008.946116 main_thread [2781] 6.451 Mpps (6.457 Mpkts 105.592 Gbps in 1000995 usec) 256.00 avg_batch 768 min_space
009.947121 main_thread [2781] 6.486 Mpps (6.493 Mpkts 106.173 Gbps in 1001005 usec) 256.02 avg_batch 1 min_space
010.948119 main_thread [2781] 6.498 Mpps (6.504 Mpkts 106.358 Gbps in 1000998 usec) 256.00 avg_batch 768 min_space
011.949118 main_thread [2781] 6.476 Mpps (6.482 Mpkts 105.997 Gbps in 1000999 usec) 256.04 avg_batch 512 min_space
012.950101 main_thread [2781] 6.550 Mpps (6.556 Mpkts 107.209 Gbps in 1000983 usec) 256.00 avg_batch 768 min_space
013.951109 main_thread [2781] 6.169 Mpps (6.175 Mpkts 100.973 Gbps in 1001007 usec) 256.04 avg_batch 1 min_space
014.952115 main_thread [2781] 5.327 Mpps (5.333 Mpkts 87.201 Gbps in 1001006 usec) 256.06 avg_batch 512 min_space
015.953115 main_thread [2781] 6.106 Mpps (6.112 Mpkts 99.948 Gbps in 1001001 usec) 256.05 avg_batch 1 min_space

So, this reaches over 100Gbps currently. We also adhere to the average batch size as well. This seems much faster than the Python version. Speeds will go down as things become more safe but I couldn't help but share these amazing initial results. By implementing this in Python and C at the same time we have a broad view of what we want in the end. I am still battling things like ArrayLists as I continue to learn Zig but so far it has had enough initial functionality to get this job done.

wolfspider · 2024-12-03T06:02:14Z

Since departing from the Python code I also realized we could reuse a lemma for a fully safe initialization routine and so that will get added into there as well. Grabbing hold of the memory addresses can be tricky. In order to initialize a ring here is an example from the netmap headers:

#define NETMAP_RXRING(nifp, index)                                             \
    _NETMAP_OFFSET(struct netmap_ring *, nifp,                             \
                   (nifp)->ring_ofs[index + (nifp)->ni_tx_rings +          \
                                    (nifp)->ni_host_tx_rings])

// Expands to
((struct netmap_ring *)(void *)((char *)(nifp) +
                                ((nifp)->ring_ofs[i + (nifp)->ni_tx_rings +
                                                  (nifp)->ni_host_tx_rings])))

Zig uses cimport to deal with these things but at times the translation is not 100% but it is good enough to get us through these rough areas:

pub inline fn _NETMAP_OFFSET(@"type": anytype, ptr: anytype, offset: anytype) @TypeOf(@"type"(?*anyopaque)(@import("std").zig.c_translation.cast([*c]u8, ptr) + offset)) {
    _ = &@"type";
    _ = &ptr;
    _ = &offset;
    return @"type"(?*anyopaque)(@import("std").zig.c_translation.cast([*c]u8, ptr) + offset);
}

pub inline fn NETMAP_RXRING(nifp: anytype, index_1: anytype) @TypeOf(_NETMAP_OFFSET([*c]struct_netmap_ring, nifp, nifp.*.ring_ofs[@as(usize, @intCast((index_1 + nifp.*.ni_tx_rings) + nifp.*.ni_host_tx_rings))])) {
    _ = &nifp;
    _ = &index_1;
    return _NETMAP_OFFSET([*c]struct_netmap_ring, nifp, nifp.*.ring_ofs[@as(usize, @intCast((index_1 + nifp.*.ni_tx_rings) + nifp.*.ni_host_tx_rings))]);
}

@"type" isn't actually valid and this function fails unfortunately and this is where we need to step in and fix things up. It gets most of the way there which is much better than what another programming language would provide. Zig actually tries to articulate all the intricacies here which is why I took up coding with it in the first place.

wolfspider · 2024-12-03T06:10:56Z

As time goes on things will become less procedural and hopefully that buildpacket function will too. Right now this makes it easy to drop values ad-hoc when needed and also tells the tale of the weirdness I have experienced up until now. For buildpacket I was experiencing some memcpy memory corruption when copying over non-contiguous pieces of memory and so I did unfortunately get to the point of auditing every single member of that array. Zig is a young language so I keep bumping into things like that. The performance however speaks for itself. I could write my own tools to do what Zig does and have my own DSL or just use Zig.

wolfspider · 2024-12-04T07:03:57Z

100Gbps seems excessive doesn't it? This is because it turns out I was updating tx and rx slot length simultaneously in some hideous mistake. Whenever you get these astronomical values it is good to check further and see what is going on. Since writing this I had a power outage and after bringing the machine back up nothing worked. It turns out there were still a few things wrong. Since then I simplified the api some more to store the addresses of both rings and buffers as pointers and successfully use anyopaque to store the buffers. This looks a bit more concise:

for (rng.slots.?.items) |slot| {
        const msg: []const u8 = buildPacket();

        std.debug.print("buf idx: {}\n", .{slot._slot.?.buf_idx});

        const bf: [*c]u8 = @alignCast(@ptrCast(slot._view.buf));

        const len: u16 = @truncate(msg.len);

        std.crypto.secureZero(
            u8,
            bf[0..msg.len],
        );

        @memcpy(bf[0..msg.len], msg);
        slot._slot.?.len = len;

        std.debug.print("C String: {X}\n", .{bf[0..msg.len]});

        idx += 1;
    }

wolfspider · 2024-12-04T07:11:17Z

The Python speeds are as follows for tx.py:

870.232124 main_thread [2781] 17.326 Mpps (17.335 Mpkts 8.317 Gbps in 1000479 usec) 256.34 avg_batch 1 min_space
871.233117 main_thread [2781] 19.112 Mpps (19.131 Mpkts 9.174 Gbps in 1000994 usec) 256.22 avg_batch 1 min_space
872.234124 main_thread [2781] 17.408 Mpps (17.425 Mpkts 8.356 Gbps in 1001007 usec) 256.14 avg_batch 256 min_space
873.235122 main_thread [2781] 18.698 Mpps (18.717 Mpkts 8.975 Gbps in 1000998 usec) 256.34 avg_batch 1 min_space
874.236123 main_thread [2781] 19.919 Mpps (19.939 Mpkts 9.561 Gbps in 1001001 usec) 256.15 avg_batch 1 min_space
875.237122 main_thread [2781] 18.584 Mpps (18.603 Mpkts 8.921 Gbps in 1000999 usec) 256.10 avg_batch 1 min_space
876.238111 main_thread [2781] 19.494 Mpps (19.513 Mpkts 9.357 Gbps in 1000989 usec) 256.06 avg_batch 256 min_space

And now the new and improved Zig speed:

956.379630 main_thread [2781] 24.087 Mpps (24.112 Mpkts 11.562 Gbps in 1001033 usec) 256.05 avg_batch 0 min_space
957.380678 main_thread [2781] 24.205 Mpps (24.230 Mpkts 11.618 Gbps in 1001048 usec) 256.03 avg_batch 512 min_space
958.381724 main_thread [2781] 24.406 Mpps (24.431 Mpkts 11.715 Gbps in 1001047 usec) 256.02 avg_batch 512 min_space
959.382651 main_thread [2781] 24.374 Mpps (24.396 Mpkts 11.699 Gbps in 1000927 usec) 256.03 avg_batch 1 min_space
960.383717 main_thread [2781] 23.921 Mpps (23.946 Mpkts 11.482 Gbps in 1001066 usec) 256.04 avg_batch 256 min_space
961.384121 main_thread [2781] 24.063 Mpps (24.073 Mpkts 11.550 Gbps in 1000403 usec) 256.02 avg_batch 512 min_space
962.385148 main_thread [2781] 24.095 Mpps (24.120 Mpkts 11.566 Gbps in 1001027 usec) 256.01 avg_batch 256 min_space
963.386189 main_thread [2781] 22.208 Mpps (22.231 Mpkts 10.660 Gbps in 1001042 usec) 256.02 avg_batch 256 min_space
964.387235 main_thread [2781] 21.936 Mpps (21.959 Mpkts 10.529 Gbps in 1001046 usec) 256.02 avg_batch 1 min_space
965.388283 main_thread [2781] 23.325 Mpps (23.349 Mpkts 11.196 Gbps in 1001047 usec) 256.02 avg_batch 256 min_space
966.389328 main_thread [2781] 24.206 Mpps (24.232 Mpkts 11.619 Gbps in 1001046 usec) 256.02 avg_batch 1 min_space

As you can see they are nearly the same because they are, for once, actually doing the same thing! Imagine that. After getting all excited I realized how seriously bad it was that I was sending these packets with a length of 60 and somehow saturating the connection. Basically, that's just not possible if everything is exactly the same. I finally do get packets on the other end of the vale switch using onepacket.py as well:

Waiting for a packet to come
Received a packet with len 60
ffffffffffff000000000000080000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Waiting for a packet to come
Received a packet with len 60
ffffffffffff000000000000080000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Waiting for a packet to come
Received a packet with len 60
ffffffffffff000000000000080000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Waiting for a packet to come
Received a packet with len 60
ffffffffffff000000000000080000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

The number one primary thing to remember is that netmap does not make your network go faster! Despite this many applications using it are based around doing things fast- go figure.

	pub fn buildPacket() []const u8 {
	// Define the packet components
	const dest_mac: [6]u8 = [_]u8{ 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF }; // Destination MAC
	const src_mac: [6]u8 = [_]u8{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 }; // Source MAC
	const ethertype: [2]u8 = [_]u8{ 0x08, 0x00 }; // EtherType (IPv4)
	const payload: [46]u8 = [_]u8{0} ** 46; // Payload (46 bytes of zeroes)

	// Combine all components into a single array
	const packet: [60]u8 = [_]u8{
	// Destination MAC
	dest_mac[0], dest_mac[1], dest_mac[2], dest_mac[3], dest_mac[4], dest_mac[5],
	// Source MAC
	src_mac[0], src_mac[1], src_mac[2], src_mac[3], src_mac[4], src_mac[5],
	// EtherType
	ethertype[0], ethertype[1],
	// Payload
	payload[0], payload[1], payload[2], payload[3],
	payload[4], payload[5], payload[6], payload[7], payload[8], payload[9],
	payload[10], payload[11], payload[12], payload[13], payload[14], payload[15],
	payload[16], payload[17], payload[18], payload[19], payload[20], payload[21],
	payload[22], payload[23], payload[24], payload[25], payload[26], payload[27],
	payload[28], payload[29], payload[30], payload[31], payload[32], payload[33],
	payload[34], payload[35], payload[36], payload[37], payload[38], payload[39],
	payload[40], payload[41], payload[42], payload[43], payload[44], payload[45],
	};

	// Return as a slice
	return packet[0..];
	}

wolfspider/build_packet.zig

wolfspider commented Dec 3, 2024 •

edited

Loading

Uh oh!

wolfspider commented Dec 3, 2024 •

edited

Loading

Uh oh!

wolfspider commented Dec 3, 2024 •

edited

Loading

Uh oh!

wolfspider commented Dec 3, 2024

Uh oh!

wolfspider commented Dec 3, 2024

Uh oh!

wolfspider commented Dec 3, 2024

Uh oh!

wolfspider commented Dec 4, 2024

Uh oh!

wolfspider commented Dec 4, 2024

Uh oh!

wolfspider/build_packet.zig

wolfspider commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wolfspider commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wolfspider commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wolfspider commented Dec 3, 2024

Uh oh!

wolfspider commented Dec 3, 2024

Uh oh!

wolfspider commented Dec 3, 2024

Uh oh!

wolfspider commented Dec 4, 2024

Uh oh!

wolfspider commented Dec 4, 2024

Uh oh!

wolfspider commented Dec 3, 2024 •

edited

Loading

wolfspider commented Dec 3, 2024 •

edited

Loading

wolfspider commented Dec 3, 2024 •

edited

Loading