What you need may be “pipeline +Unix commands” only

I came across Taco Bell Programming recently, and think this article is worthy to read for every software engineer. The post mentions a scenario which you may consider to use Hadoop to solve but actually xargs may be a simpler and better choice. This reminds me a similar experience: last year a client wanted me to process a data file which has 5 million records. After some investigations, no novel technologies, a concise awk script (less than 10 lines) worked like a charm! What surprised me more is that awk is just a single-thread program, no nifty concurrency involved.

The IT field never lacks “new” technologies: cloud computing, big data, high concurrency, etc. However, the thinkings behind these “fancy” words may date back to the era when Unix arose. Unix command line tools are invaluable treasure. In many cases, picking the right components and using pipeline to glue them can satisfy your requirement perfectly. So spending some time in reviewing Unixcommand line manual instead of chasing state-of-the-art techniques exhaustedly, you may gain more.

BTW, if your data set can be disposed by an awk script, it should not be called “big data”.

5 thoughts on “What you need may be “pipeline +Unix commands” only”

  1. I recently followed this philosophy to grab some files from one folder to another with a new name and then upload to S3. Basic Unix commands like mv, chmod and the AWS CLI tied together in a shell script did the job. This script would run as a daemon, detecting new files in the source directory and starting the mv/upload processing.

    Problem was when the script failed to run the first couple of deploys. I usually rely on tests to make sure that this doesn’t happen, so that’s what I did.
    It happens that it’s more difficult to test a shell script because it is basically a black box – no way of importing functions/modules and test them in isolation. The only alternative available is a high level integration test. Also, simple tasks such as string replacement (relevant for my case) are difficult to grasp because they are different from any programming language. This makes your code hard to maintain.

    Ported the thing to python, wrote some tests and voilá.

    I still value simplicity as high as reliability and security, but sometimes a simple shell script leveraging Unix tools can actually make things more complex and unreliable.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.