I came across Taco Bell Programming recently, and think this article is worthy to read for every software engineer. The post mentions a scenario which you may consider to use Hadoop
to solve but actually xargs
may be a simpler and better choice. This reminds me a similar experience: last year a client wanted me to process a data file which has 5
million records. After some investigations, no novel technologies, a concise awk
script (less than 10
lines) worked like a charm! What surprised me more is that awk
is just a single-thread program, no nifty concurrency involved.
The IT field never lacks “new” technologies: cloud computing, big data, high concurrency, etc. However, the thinkings behind these “fancy” words may date back to the era when Unix
arose. Unix
command line tools are invaluable treasure. In many cases, picking the right components and using pipeline to glue them can satisfy your requirement perfectly. So spending some time in reviewing Unix
command line manual instead of chasing state-of-the-art techniques exhaustedly, you may gain more.
BTW, if your data set can be disposed by an awk
script, it should not be called “big data”.
Yes, buddy, the simplest tools are the most powerful, and the fastest. Time is money.
All knowledge but no wisdom these days!
Agreed. Well stated.
I recently followed this philosophy to grab some files from one folder to another with a new name and then upload to S3. Basic Unix commands like mv, chmod and the AWS CLI tied together in a shell script did the job. This script would run as a daemon, detecting new files in the source directory and starting the mv/upload processing.
Problem was when the script failed to run the first couple of deploys. I usually rely on tests to make sure that this doesn’t happen, so that’s what I did.
It happens that it’s more difficult to test a shell script because it is basically a black box – no way of importing functions/modules and test them in isolation. The only alternative available is a high level integration test. Also, simple tasks such as string replacement (relevant for my case) are difficult to grasp because they are different from any programming language. This makes your code hard to maintain.
Ported the thing to python, wrote some tests and voilá.
I still value simplicity as high as reliability and security, but sometimes a simple shell script leveraging Unix tools can actually make things more complex and unreliable.
@JD Costa ➠ https://www.destroyallsoftware.com/screencasts/catalog/simple-bash-script-testing