How to process large file?

In Process large data in external memory, I mentioned:

Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.

I want to elaborate how to process large file here:

(1) Split the large file into small ones which are independent from each other. E.g., based on users. Then you can spawn multiple threads to process each small file.

(2) For the output: if all threads output to same file, the write operations must be atomic and it will become bottleneck of the program. So every thread should have its own output file.

(3) After all threads exit, main thread can use cat or other methods to consolidate all output files into one.

Treasure core dump file

Recently I fixed a memory corruption issue, i.e., for a 8-byte memory address, one byte in the middle was set to 0, so the memory address became invalid, and accessing of this memory caused program crash. This reminded me another memory corruption issue which I fixed before. From my experience, this kind of memory corruption issues are very difficult to debug: the adjacent memory is all good, only one or several bytes are changed to other values. These bugs are not obvious out-of-bound memory access problems, and difficult to find methods to reproduce.

Generally speaking, logs can’t always give you a hand when the memory is random corrupted, not mention in some situations, the traces won’t be provided for reasons. The only thing you can get is the core dump file, and you must utilize the file and try to unearth as much information as possible. E.g., from program’s perspective, what was the state of program when it crashed? Except the ruined memory, were there other abnormalities? From system’s perspective, have you observed all the registers’ values? Are they all valid? If not, which part of code can cause it?

So every time, when you meet a not-easy reproduced bug, don’t freak out. Just calm down and begin to analyze core dump file carefully. You become a detective and core dump file is the crime scene. In reality, you can’t require the criminal to commit again. Similarly, not every bug can reoccur; you must try your best to find the root cause from the core dump file. From my experience, every tough debugging experience can make you understand program and system better. So it is a precious learning opportunity.

Treasure core dump file and enjoy debugging!

Two practical software engineering rules

There are so many huge books which introduce software engineering, and in this article, I want to share two practical rules which are based on my own experience.

(1) No fear of refactoring

As time goes on, refactoring code is inevitable: the original design can’t handle current situation seamlessly; we can use the new characteristics of programming language to polish existed code, etc. Since refactoring code is time-consuming, risking, and costly, many companies are reluctant to do it for some reasons. Whereas the refactoring is beneficial to both company and engineers literally.

For companies: After refactorig, the code should become more reasonable and easier-maintainable, and the consequence is that it will save you much time and cost to add new features. For engineers: refactoring code can let you be more familiar with the the code logic, try using new characteristics of programming language and practice module design skills, and it is a precious opportunity to enrich yourself. So in the long run, refactoring code is a win-win situation actually. ( If the software quality becomes worse, oh boy! Don’t refactor it!)

(2) “Real” peer-to-peer code review

I haven’t experienced pair-programming, but took part in many “fake” peer-to-peer code review: before reviewing, the reviewer didn’t read code before. During the reviewing, the code author needed to spell out what was the intention of this code, then the reviewer would analyze the code on the spot. It seemed the reviewer and code author were very busy in the review meeting, but in fact it was a totally time-wasting and inefficient!

From my viewpoint, there should be two maintainers for any software module, and the two maintainers have the same familiarity of code. No matter adding a new big feature or just fixing a small bug, the two maintainers should co-work the whole design flow in advance, then if the task is small, one maintainer can take over the whole work, otherwise they can share it. Since everybody has took part in the discussion before, he/she can review partner’s code alone. This method can avoid misleading by code author, saving time, and finding bug efficiently. The potential benefit for company is if one guy resigns, there is no loss because there is always another engineer who is an genuine backup.

These two rules seem feasible? Why not give them a shot?

Process large data in external memory

This week, I implemented a small program which handles a data set. The volume of data set is so big that it can’t be stored in main memory.

I first tried to use stxxlstxxl is an awesome library which mimics STL and processes data in external memory. But it has many limitations, such as data type should be plain old data type. Since my type doesn’t provide default constructor, stxxl can’t satisfy my need (please refer this discussion). I also make attempts on other workaruonds, but all failed.

Finally, I used a simple method: Open a file, serialize the data set into file, and treat the file like the main memory. Although it is not the most efficient approach, the program is vey clear, and not prone to bugs. So I decide to use it as a demo, and improve it gradually.

Update: Split large file into smaller ones, and use multiple threads to handle them is a good idea.

 

Some tips of using “pool” in programming

In the past weeks, I am dived into utilizing CUDA APIs to operate on multiple GPUs. Among my work, one is a “memory pool” module which manages allocating/freeing memories from different devices. In this post, I want to share some thoughts about “pool”, an interesting tech in programming.

The “pool” works like a “cache”: it allocates some resource beforehand; if the application wants it, the “pool” picks up a available one for it. One classical example is the “database connection pool”: the “pool” preallocates some TCP connections and keep them alive, and this will save client from handshaking with server every time. The other instance is “memory pool”, which I implemented recently. The “pool” keeps the freed memory, and not really release it to device. Based on my benchmark test, the application’s performance can get a 100% improvement in extreme case. (Caveat: For multithreaded application, if the locking mechanism of “pool” becomes the performance bottleneck, try every thread has its own private “pool”.)

The other function of using “pool” is for debugging purpose. Still use my “memory pool” as an demonstration: for every memory pointer, there are a device ID (which GPU this memory is allocated from) and memory size accompanied with it. So you can know the whole life of this block memory clearly from analyzing trace log. This has saved me from notorious “an illegal memory access was encountered” error many times in multiple GPU programming.

Last but not least, although the number of this “memory pool”‘s code lines is merely ~100, I already use following data structures: queue, vector, pair and map. Not mention the mutex and lock, which are essential to avoid nasty data-race issues. So writing this model is a good practice to hone my programming craft.

Enjoy coding!