Process large data in external memory

This week, I implemented a small program which handles a data set. The volume of data set is so big that it can’t be stored in main memory.

I first tried to use stxxlstxxl is an awesome library which mimics STL and processes data in external memory. But it has many limitations, such as data type should be plain old data type. Since my type doesn’t provide default constructor, stxxl can’t satisfy my need (please refer this discussion). I also make attempts on other workaruonds, but all failed.

Finally, I used a simple method: Open a file, serialize the data set into file, and treat the file like the main memory. Although it is not the most efficient approach, the program is vey clear, and not prone to bugs. So I decide to use it as a demo, and improve it gradually.

Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.