This week, I implemented a small program which handles a data set. The volume of data set is so big that it can’t be stored in main memory.
I first tried to use stxxl. stxxl
is an awesome library which mimics STL
and processes data in external memory. But it has many limitations, such as data type should be plain old data type. Since my type doesn’t provide default constructor, stxxl
can’t satisfy my need (please refer this discussion). I also make attempts on other workaruonds, but all failed.
Finally, I used a simple method: Open a file, serialize the data set into file, and treat the file like the main memory. Although it is not the most efficient approach, the program is vey clear, and not prone to bugs. So I decide to use it as a demo, and improve it gradually.
Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.