In Process large data in external memory, I mentioned:
Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.
I want to elaborate how to process large file here:
(1) Split the large file into small ones which are independent from each other. E.g., based on users. Then you can spawn multiple threads to process each small file.
(2) For the output: if all threads output to same file, the write operations must be atomic and it will become bottleneck of the program. So every thread should have its own output file.
(3) After all threads exit, main thread can use cat
or other methods to consolidate all output files into one.