2014-09-27

Another day in my love affair with AWK

I consider myself a C/C++ developer. Right now I am embracing C++11 (I wanted to wait till it is actually well supported by compilers) and I am loving it.

Despite my happy relationship with C/C++ I have maintained a torrid affair with AWK for many years, which has spilled into this blog before:

  • Almost a year ago I concluded that MAWK is freakin' fast and GNU AWK freakin' fast as a snail
  • The past summer I stumbled over a bottleneck in the one-true-AWK, default for *BSD and Mac OS-X

A Matter of Accountability

So far circumstances dictated that either the script or the input data or both had to be kept secret. In this post both will be publicly available. The purpose of this post is to give people the chance to perform their own tests.

The following is required to perform the test:

The dbc2c.awk script was already part of my first post. It parses Vector DBC (Database CAN) files, an industry standard for describing a set of devices, messages and signals for the real time bus CAN (one can argue it's soft real time, it depends). It does the following things:

  1. Parse data from 1 or more input files
  2. Store the data in arrays, use indexes as references to describe relationships
  3. Output the data
    1. Traverse the data structure and store attributes of objects in an array
    2. Read a template
    3. Insert data into the template and print on stdout
Test Environment
  • The operating system:
    FreeBSD AprilRyan.norad 10.1-BETA2 FreeBSD 10.1-BETA2 #0 r271856: Fri Sep 19 12:55:39 CEST 2014 root@AprilRyan.norad:/usr/obj/S403/amd64/usr/src/sys/S403 amd64
  • The compiler:
    FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
    Target: x86_64-unknown-freebsd10.1
    Thread model: posix
  • CPU: Core i7@2.4GHz (Haswell)
  • NAWK version: awk version 20121220 (FreeBSD)
  • MAWK version: mawk 1.3.4.20140914
  • GNU AWK version: GNU Awk 4.1.1, API: 1.1

Tests

With the recent changeset 219:01114669a8bf, the script switched from using array iteration (for (index in array) { … }) to creating a numbered index for each object type and iterate through them in order of creation to make sure data is output in the same order with every AWK implementation. This makes it much easier to compare and validate outputs from different flavours of AWK.

To reproduce the tests, run:

time -l awk -f scripts/dbc2c.awk -vDATE=whenever j1939_utf8.dbc | sha256

The checksum for the output should read:

9f0a105ed06ecac710c20d863d6adefa9e1154e9d3a01c681547ce1bd30890df

Here are my runtime results:

6.23 s
6.32 s
6.27 s
11.79 s
11.88 s
11.80 s
1.98 s
2.02 s
1.97 s

Memory usage (maximum resident set size):

22000 k
50688 k
26644 k

Conclusion

Once again the usual order of things establishes itself. GNU AWK wastes our time and memory while MAWK takes the winner's crown and NAWK keeps to the middle ground.

The dbc2c.awk script has been tested before and GNU AWK actually performs much better this time, 6.0 instead of 9.6 time slower than MAWK. Maybe just parsing one file instead of 3 helps or the input data produces less collisions for the hashing algorithm (AWK array indexes are always cast to string and stored in hash tables).

In any way I'd love to see some more benchmarks out there. And maybe someone bringing their favourite flavour of AWK to the table.