Process large pcap file

To process large pcap file, usually it is better to split it into small chunks first, then process every chunk in parallel. I implement a simple shell script to do it:

#!/bin/sh

input_pcap=input.pcap
output_pcap=./pcap/frag.pcap
spilt_size=1000
output_index=1
loop_count=10
exit_flag=0

command() {
    echo "$1" "$2" > log"$2"
}

tcpdump -r ${input_pcap} -w ${output_pcap} -C ${spilt_size}

command ${output_pcap}

while :
do
    loop_index=0
    while test ${loop_index} -lt ${loop_count}
    do
        if test -e ${output_pcap}${output_index}
        then
            command ${output_pcap} ${output_index} &
            output_index=$((output_index + 1))
            loop_index=$((loop_index + 1))
        else
            exit_flag=1
            break
        fi
    done
    wait

    if test ${exit_flag} -eq 1
    then
        exit 0
    fi
done

First of all, split input pcap file into 1GB chunks. Then launch 10 processes to crunch data (in above example, just simple output). Definitely you can customize it.

BTW, the code can be downloaded here.

The IO stream’s state when EOF occurs

Check following simple C++ program:

#include <iostream>
using namespace std;

int
main()
{
    char ch;

    while (cin >> ch)
    {
        cout << ch << '\n';
    }

    cout << "bad: " << cin.bad() << ", eof: " << cin.eof() << ", fail: " << cin.fail() << '\n';

    return 0;
}

Compile and run it, press Ctrl + D:

$ c++ foo.cpp -o foo
$ ./foo
bad: 0, eof: 1, fail: 1

We can see both fail and eof bits are set to 1. From this table, we can see when both fail and eof bits are set to 1, the operator bool of stream will return false.

pcap_next_ex() blocks in Void Linux forever

My Void Linux is a virtual machine. I implemented a simple capturing packets program using libpcap, but found the program is blocked in pcap_next_ex():

(gdb) bt
#0  0x00007ffff7ad7763 in __GI___poll (fds=fds@entry=0x7fffffffe160, nfds=nfds@entry=1, timeout=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007ffff7f8309d in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7fffffffe160) at /usr/include/bits/poll2.h:46
#2  pcap_wait_for_frames_mmap (handle=handle@entry=0x55555556eba0) at ./pcap-linux.c:5018
#3  0x00007ffff7f886f4 in pcap_read_linux_mmap_v3 (handle=0x55555556eba0, max_packets=1, callback=0x7ffff7f82bf0 <pcap_oneshot_mmap>,
    user=0x7fffffffe1f0 "\260\355VUUU") at ./pcap-linux.c:5577
#4  0x00007ffff7f8c6a2 in pcap_next_ex (p=<optimized out>, pkt_header=<optimized out>, pkt_data=<optimized out>) at ./pcap.c:505
......

The libpcap version in Void Linux is 1.9.1. After checking code, I found although I set timeout in pcap_open_live(), this value doesn’t take effect in set_poll_timeout:

    ......
    if (handlep->tp_version == TPACKET_V3 && !broken_tpacket_v3)
        handlep->poll_timeout = -1; /* block forever, let TPACKET_V3 wake us up */
    ......

I don’t have physical machine, so not sure this issue only happens in virtual machine or not.

Enhance Hyperscan’s pcapCorpus.py script

Hyperscan‘s pcapCorpus.py script can convert a pcap file containing UDP and TCP packets to a corpus file which can be processed by hsbench program, I did some improvements to this script:

a) Support parsing VLAN packets;
b) Continue to handle instead of exit when meeting exceptional packets.

P.S., the code can be downloaded here.

Cacheline-Orientated programming

From CPU’s perspective, the memory hierarchy is registers, L1 cache, L2 cache, L3 cache, main memory, among others. The smallest unit of cache is one cacheline, and it is 64 bytes in most cases:

$ getconf LEVEL1_DCACHE_LINESIZE
64

To make your applications run efficiently, you need to take cacheline into account. Take notorious cacheline fales sharing as an example:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[14];
    };
    .....

The size of struct Foo is 64 bytes, and it can be stored in one cacheline. If CPU 0 accesses Foo.a while CPU 1 accesses Foo.b at the same time, there will be “cacheline ping-ponging” between CPUs, and the performance will be downgraded drastically.

The other trick is to allocate memory cacheline size aligned. Still use above struct Foo as the example. To guarantee the whole struct Foo in one cacheline, posix_memalign can be used:

    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo));

The 64 is the alignment requirement.

Last but not least, sometimes padding is needed. E.g.:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
        int padding[2];
    };
    ......
    struct Foo *foo;
    posix_memalign(&foo, 64, sizeof(struct Foo) * 10);

Or using compiler’s aligned attribute:

    ......
    struct Foo
    {
        int a;
        int b;
        int c[12];
    } __attribute__((aligned(64)));;
    ......

The original struct Foo‘s size is 56 bytes, after padding (or through compiler’s aligned attribure), it becomes 64 bytes, and can be loaded in one cacheline. Now we can allocate an array of struct Foo, and every CPU will process one element of the array, no “cacheline ping-ponging” will occur.