How to process large file?

In Process large data in external memory, I mentioned:

Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.

I want to elaborate how to process large file here:

(1) Split the large file into small ones which are independent from each other. E.g., based on users. Then you can spawn multiple threads to process each small file.

(2) For the output: if all threads output to same file, the write operations must be atomic and it will become bottleneck of the program. So every thread should have its own output file.

(3) After all threads exit, main thread can use cat or other methods to consolidate all output files into one.

Beware of using GNU libc basename() function

From the manual page, we know there are two versions of basename() implementation. One for POSIX-compliant:

#include <libgen.h>

char *basename(char *path);

Another for GNU version:

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <string.h>

But the manual doesn’t mention that the prototype type of GNU version is different from POSIX-compliant one (The parameter type is const char*, not char*):

char *basename (const char *__filename)

And the implementation is also simple, just invokes strrchr():

char *
__basename (const char *filename)
{
  char *p = strrchr (filename, '/');
  return p ? p + 1 : (char *) filename;
}

Rewrite a python program using C to boost performance

Recently I converted a python program to C. The python program will run for about 1 hour to finish the task:

$ /usr/bin/time -v taskset -c 35 python_program ....
......
        User time (seconds): 3553.48
        System time (seconds): 97.70
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00:51
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 12048772
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 11434463
        Voluntary context switches: 58873
        Involuntary context switches: 21529
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4704
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

while the C program only needs about 5 minutes:

$ /usr/bin/time -v taskset -c 35 c_program ....
......
        User time (seconds): 282.45
        System time (seconds): 8.66
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:51.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 16430216
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 3962437
        Voluntary context switches: 14
        Involuntary context switches: 388
        Swaps: 0
        File system inputs: 1918744
        File system outputs: 4960
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

From the /usr/bin/time‘s output, we can see python program uses less memory than C program, but suffers more “page faults” and “context switches”.

Valgrind can’t work with sanitizers together

Valgrind can’t work with sanitizers together. Check following program with explicit memory leak:

# cat memory-leak.c
#include <stdlib.h>
void *p;
int main() {
  p = malloc(7);
  p = 0; // The memory is leaked here.
  return 0;
}

Build it and run with valgrind:

# gcc memory-leak.c -o memory-leak
# valgrind ./memory-leak
==1155== Memcheck, a memory error detector
==1155== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1155== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1155== Command: ./memory-leak
==1155==
==1155==
==1155== HEAP SUMMARY:
==1155==     in use at exit: 7 bytes in 1 blocks
==1155==   total heap usage: 1 allocs, 0 frees, 7 bytes allocated
==1155==
==1155== LEAK SUMMARY:
==1155==    definitely lost: 7 bytes in 1 blocks
==1155==    indirectly lost: 0 bytes in 0 blocks
==1155==      possibly lost: 0 bytes in 0 blocks
==1155==    still reachable: 0 bytes in 0 blocks
==1155==         suppressed: 0 bytes in 0 blocks
==1155== Rerun with --leak-check=full to see details of leaked memory
==1155==
==1155== For lists of detected and suppressed errors, rerun with: -s
==1155== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

The memory leak is detected successfully. Build the program with “-fsanitize=address” option and run valgrind again:

# gcc -fsanitize=address memory-leak.c -o memory-leak
# valgrind ./memory-leak
==1193== Memcheck, a memory error detector
==1193== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1193== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==1193== Command: ./memory-leak
==1193==
==1193==ASan runtime does not come first in initial library list; you should either link runtime to your application or manually preload it with LD_PRELOAD.
==1193==
==1193== HEAP SUMMARY:
==1193==     in use at exit: 0 bytes in 0 blocks
==1193==   total heap usage: 0 allocs, 0 frees, 0 bytes allocated
==1193==
==1193== All heap blocks were freed -- no leaks are possible
==1193==
==1193== For lists of detected and suppressed errors, rerun with: -s
==1193== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

We can see the valgrind can’t work normally.

Reference:
Sourceforge.

Building OmniOS requires some memory

I followed Building OmniOS to build OmniOS, but met following errors:

$ pfexec /opt/ooce/bin/omni build_world
......
./buildctl: fork: Not enough space
../lib/functions.sh: line 627: wait_for: No record of process 22984
../lib/functions.sh: fork: Not enough space
Waiting for illumos build...
......

The reason is the virtual machine which I used has only ~1.5G memory, and it is not enough. Increase the memory to 4G, then it is OK to build.