Awk crunches massive data; a High Performance Computing (HPC) script calls hundreds of Awk concurrently. Fast and scalable in-memory solution on a fat machine.
Introduction
Presenting the solution I worked on in 2018, to a Data Challenge organized at work. I solve the Scientific Publications Mining challenge (no.4) that consists of 5 problems. I use classic Unix tools with a modern scalable HPC scripting tool to work out the solutions. The project is hosted on github. About 12 teams entered the contest.
Tools
Software
Awk (gawk v4.0.2) is dominantly used for the bulk of core processing.
Argonne National Laboratory developed HPC scripting tool called Swift (NOT the Apple Swift) is used to run the Awk programs concurrently over the dataset to radically improve performance. Swift uses MPI based communication to parallelize and synchronize independent tasks.
Other Unix tools such as sort, grep, tr, sed and bash are used as well. Additionally, jq, D3, dot/graphviz, and ffmpeg are used.
Hardware
Fortunately, I had access to a large-memory (24 T) SGI system with 512-core Intel Xeon (2.5GHz) CPUs. All the IO is memory (/dev/shm) bound ie. the data is read from and written to /dev/shm.