Abrazolica

Home Archive Tags About RSS

Converting a Spaces-Separated Data File to CSV with awk

posted on: Monday, March 5th, 2012, 14:08

Transforming data files can often be done in a powerful text editor like emacs. But in the age of 'big data' it's sometimes better to do the transforming on the command line to avoid loading the whole thing into memory. Below is an example. This particular file wasn't too large to load into emacs, but it's a good example for how to avoid that.

In downloading a daily weather data file for Boulder Colorado recently, we found it had 8 fields:

year, month, day, max_temp, min_temp, precip, snowfall, snowdepth

But the fields were separated by multiple varying numbers of spaces as shown here:

2011 12 20     42     16    0.01  -999.0      2
2011 12 21     39     22 -999.00  -999.0      2
2011 12 22     33     15    0.93    14.6     16
2011 12 23     44      5    0.00     0.0     14
2011 12 24     47     16    0.00     0.0     12
2011 12 25     46     21    0.00     0.0     11
2011 12 26     53     20    0.00     0.0      9
2011 12 27     55     21    0.00     0.0      9
2011 12 28     59     31    0.00     0.0      7
2011 12 29     56     43    0.00     0.0      4
2011 12 30     61     38    0.00     0.0      3
2011 12 31     59     33 -999.00  -999.0      2
2012  1  1     35     19    0.00     0.0      2
2012  1  2     53     19    0.00     0.0      2
2012  1  3     57     29    0.00     0.0      1
2012  1  4     61     24    0.00     0.0   -999
2012  1  5     69     39    0.00     0.0   -999
2012  1  6     60     31    0.00     0.0   -999
2012  1  7     39     21    0.08     0.8      1
2012  1  8     39     16    0.04     0.6      1
2012  1  9     52     22    0.00     0.0   -999
2012  1 10     61     29    0.00     0.0   -999
2012  1 11     46     16    0.16     4.7      5
2012  1 12     44      2 -999.00  -999.0      3
2012  1 13     46     23    0.00     0.0   -999
2012  1 14     59     31    0.00     0.0   -999
2012  1 15     55     43    0.00     0.0   -999
2012  1 16     53     21    0.01     0.1   -999
2012  1 17     32      6    0.04     0.9      1
2012  1 18     49     20 -999.00  -999.0   -999
2012  1 19     64     39    0.00     0.0   -999
2012  1 20     58     42    0.00     0.0      0

We wanted to convert it from varying space separated values to comma separated values (csv).

First, we had to eliminate a 1 line header and 14 line footer, saving the result as boulderdaily.noheader like this:

tail -n +2 boulderdaily.complete | head -n -14 > boulderdaily.noheader

Now, this space separated data file was converted to a csv file using awk with the following command:

awk '{printf("%d,%d,%d,%d,%d,%d,%d,%d", $1, $2, $3, $4, $5, $6, $7, $8)}' boulderdaily.noheader > boulderdaily.csv

So the new csv file looks like this:

2011,12,20,42,16,0,-999,2
2011,12,21,39,22,-999,-999,2
2011,12,22,33,15,0,14,16
2011,12,23,44,5,0,0,14
2011,12,24,47,16,0,0,12
2011,12,25,46,21,0,0,11
2011,12,26,53,20,0,0,9
2011,12,27,55,21,0,0,9
2011,12,28,59,31,0,0,7
2011,12,29,56,43,0,0,4
2011,12,30,61,38,0,0,3
2011,12,31,59,33,-999,-999,2
2012,1,1,35,19,0,0,2
2012,1,2,53,19,0,0,2
2012,1,3,57,29,0,0,1
2012,1,4,61,24,0,0,-999
2012,1,5,69,39,0,0,-999
2012,1,6,60,31,0,0,-999
2012,1,7,39,21,0,0,1
2012,1,8,39,16,0,0,1
2012,1,9,52,22,0,0,-999
2012,1,10,61,29,0,0,-999
2012,1,11,46,16,0,4,5
2012,1,12,44,2,-999,-999,3
2012,1,13,46,23,0,0,-999
2012,1,14,59,31,0,0,-999
2012,1,15,55,43,0,0,-999
2012,1,16,53,21,0,0,-999
2012,1,17,32,6,0,0,1
2012,1,18,49,20,-999,-999,-999
2012,1,19,64,39,0,0,-999
2012,1,20,58,42,0,0,0

Note that we didn't have to tell awk that the field delimiter was spaces. That's assumed by default.

This more uniform and smaller csv file is more easily dealt with by other applications like gnuplot. A side benefit is that converting multiple spaces to commas cut the file size almost in half.

This example used GNU Awk 4.0.0 on Linux.