Process large data in external memory

This week, I implemented a small program which handles a data set. The volume of data set is so big that it can’t be stored in main memory.

I first tried to use stxxlstxxl is an awesome library which mimics STL and processes data in external memory. But it has many limitations, such as data type should be plain old data type. Since my type doesn’t provide default constructor, stxxl can’t satisfy my need (please refer this discussion). I also make attempts on other workaruonds, but all failed.

Finally, I used a simple method: Open a file, serialize the data set into file, and treat the file like the main memory. Although it is not the most efficient approach, the program is vey clear, and not prone to bugs. So I decide to use it as a demo, and improve it gradually.

Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.

 

First taste of stxxl

stxxl is an interesting project:

STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i. e. STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. While the closeness to the STL supports ease of use and compatibility with existing applications, another design priority is high performance.

Recently, I have a task to do operation on data which can’t be stored in memory, so I want to give a shot of stxxl.

My OS is OpenBSD and the memory is less than 4G and I try to modify the vector1.cpp:

for (int i = 0; i < 1024 * 1024; i++)

to

for (int i = 0; i < 1024 * 1024 * 1024; i++)

Since every integer occupies 4 bytes, so it will require at least 4G storage to save the vector. Build and run this program:

# ./vector1
[STXXL-MSG] STXXL v1.4.99 (prerelease/Debug) (git 263df0c54dc168212d1c7620e3c10c93791c9c29)
[STXXL-ERRMSG] Warning: no config file found.
[STXXL-ERRMSG] Using default disk configuration.
[STXXL-MSG] Warning: open()ing /var/tmp/stxxl without DIRECT mode, as the system does not support it.
[STXXL-MSG] Disk '/var/tmp/stxxl' is allocated, space: 1000 MiB, I/O implementation: syscall delete_on_exit queue=0 devid=0
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
......
[STXXL-ERRMSG] External memory block allocation error: 2097152 bytes requested, 0 bytes free. Trying to extend the external memory space...
101
[STXXL-ERRMSG] Removing disk file: /var/tmp/stxxl

The program outputs 101 which is correct result. Check /var/tmp/stxxl before it is deleted:

# ls -alt /var/tmp/stxxl
-rw-r-----  1 root  wheel  4294967296 Mar  9 15:58 /var/tmp/stxxl

It is indeed the data file and size is 4G.

Based on this simple test, stxxl gives me a good impression, and is worthy for further exploring.

P.S., use cmake -DBUILD_TESTS=ON .. to enable building the examples.

Clang may be a better option than gcc when requiring much memory

I used an old machine (the OS is Arch Linux, and memory less than 3G) to build stxxl project:

# cmake -DBUILD_TESTS=ON ..
-- The C compiler identification is GNU 7.3.0
-- The CXX compiler identification is GNU 7.3.0
......
# make VERBOSE=1
......
cd /root/stxxl/build/examples/applications && /usr/bin/c++  -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGE_FILES -I/root/stxxl/include -I/root/stxxl/build/include  -W -Wall -pedantic -Wno-long-long -Wextra -ftemplate-depth=1024 -std=c++11 -fopenmp -g   -o CMakeFiles/skew3-lcp.dir/skew3-lcp.cpp.o -c /root/stxxl/examples/applications/skew3-lcp.cpp

The default compiler is gcc 7.3.0, and the building process was stuck at compiling skew3-lcp.cpp. The output of htop showed that nearly all memory is occupied:

1

Switch to clang:

# cmake -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DBUILD_TESTS=ON ..
-- The C compiler identification is Clang 5.0.1
-- The CXX compiler identification is Clang 5.0.1
......
# make
......
[100%] Linking CXX executable test1
[100%] Built target test1

The project can be built successfully, and the peak memory used for compiling skew3-lcp.cpp is 1.78G. Based on this test, if you have compiling task which needs much memory, clang may be a better choice than gcc.