Abrazolica
Home     Archive     Tags     About        RSS
Transforming data files can often be done in a powerful text editor like emacs. But in the age of 'big data' it's sometimes better to do the transforming on the command line to avoid loading the whole thing into memory. Below is an example. This particular file wasn't too large to load into emacs, but it's a good example for how to avoid that.
In downloading a daily weather data file for Boulder Colorado recently, we found it had 8 fields:
year, month, day, max_temp, min_temp, precip, snowfall, snowdepth
But the fields were separated by multiple varying numbers of spaces as shown here:
2011 12 20 42 16 0.01 -999.0 2
2011 12 21 39 22 -999.00 -999.0 2
2011 12 22 33 15 0.93 14.6 16
2011 12 23 44 5 0.00 0.0 14
2011 12 24 47 16 0.00 0.0 12
2011 12 25 46 21 0.00 0.0 11
2011 12 26 53 20 0.00 0.0 9
2011 12 27 55 21 0.00 0.0 9
2011 12 28 59 31 0.00 0.0 7
2011 12 29 56 43 0.00 0.0 4
2011 12 30 61 38 0.00 0.0 3
2011 12 31 59 33 -999.00 -999.0 2
2012 1 1 35 19 0.00 0.0 2
2012 1 2 53 19 0.00 0.0 2
2012 1 3 57 29 0.00 0.0 1
2012 1 4 61 24 0.00 0.0 -999
2012 1 5 69 39 0.00 0.0 -999
2012 1 6 60 31 0.00 0.0 -999
2012 1 7 39 21 0.08 0.8 1
2012 1 8 39 16 0.04 0.6 1
2012 1 9 52 22 0.00 0.0 -999
2012 1 10 61 29 0.00 0.0 -999
2012 1 11 46 16 0.16 4.7 5
2012 1 12 44 2 -999.00 -999.0 3
2012 1 13 46 23 0.00 0.0 -999
2012 1 14 59 31 0.00 0.0 -999
2012 1 15 55 43 0.00 0.0 -999
2012 1 16 53 21 0.01 0.1 -999
2012 1 17 32 6 0.04 0.9 1
2012 1 18 49 20 -999.00 -999.0 -999
2012 1 19 64 39 0.00 0.0 -999
2012 1 20 58 42 0.00 0.0 0
We wanted to convert it from varying space separated values to comma separated values (csv).
First, we had to eliminate a 1 line header and 14 line footer, saving the result as boulderdaily.noheader like this:
tail -n +2 boulderdaily.complete | head -n -14 > boulderdaily.noheader
Now, this space separated data file was converted to a csv file using awk with the following command:
awk '{printf("%d,%d,%d,%d,%d,%d,%d,%d", $1, $2, $3, $4, $5, $6, $7, $8)}' boulderdaily.noheader > boulderdaily.csv
So the new csv file looks like this:
2011,12,20,42,16,0,-999,2
2011,12,21,39,22,-999,-999,2
2011,12,22,33,15,0,14,16
2011,12,23,44,5,0,0,14
2011,12,24,47,16,0,0,12
2011,12,25,46,21,0,0,11
2011,12,26,53,20,0,0,9
2011,12,27,55,21,0,0,9
2011,12,28,59,31,0,0,7
2011,12,29,56,43,0,0,4
2011,12,30,61,38,0,0,3
2011,12,31,59,33,-999,-999,2
2012,1,1,35,19,0,0,2
2012,1,2,53,19,0,0,2
2012,1,3,57,29,0,0,1
2012,1,4,61,24,0,0,-999
2012,1,5,69,39,0,0,-999
2012,1,6,60,31,0,0,-999
2012,1,7,39,21,0,0,1
2012,1,8,39,16,0,0,1
2012,1,9,52,22,0,0,-999
2012,1,10,61,29,0,0,-999
2012,1,11,46,16,0,4,5
2012,1,12,44,2,-999,-999,3
2012,1,13,46,23,0,0,-999
2012,1,14,59,31,0,0,-999
2012,1,15,55,43,0,0,-999
2012,1,16,53,21,0,0,-999
2012,1,17,32,6,0,0,1
2012,1,18,49,20,-999,-999,-999
2012,1,19,64,39,0,0,-999
2012,1,20,58,42,0,0,0
Note that we didn't have to tell awk that the field delimiter was spaces. That's assumed by default.
This more uniform and smaller csv file is more easily dealt with by other applications like gnuplot. A side benefit is that converting multiple spaces to commas cut the file size almost in half.
This example used GNU Awk 4.0.0 on Linux.
© 2010-2012 Stefan Hollos and Richard Hollos
blog comments powered by Disqus