I came across Taco Bell Programming recently, and think this article is worthy to read for every software engineer. The post mentions a scenario which you may consider to use
Hadoop to solve but actually
xargs may be a simpler and better choice. This reminds me a similar experience: last year a client wanted me to process a data file which has
5 million records. After some investigations, no novel technologies, a concise
awk script (less than
10 lines) worked like a charm! What surprised me more is that
awk is just a single-thread program, no nifty concurrency involved.
The IT field never lacks “new” technologies: cloud computing, big data, high concurrency, etc. However, the thinkings behind these “fancy” words may date back to the era when
Unix command line tools are invaluable treasure. In many cases, picking the right components and using pipeline to glue them can satisfy your requirement perfectly. So spending some time in reviewing
Unixcommand line manual instead of chasing state-of-the-art techniques exhaustedly, you may gain more.
BTW, if your data set can be disposed by an
awk script, it should not be called “big data”.