Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
212 views
in Technique[技术] by (71.8m points)

Which AWK program can do this manipulation?

Given a file containing a structure arranged like the following (with fields separated by SP or HT)

4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w

Which AWK program do I need to get the following output?

  4 5
  m d
  t 7
  h 5
  r 5
  4 1
  x c
  0 0
  6 2
  6 7
  4 2
  6 2
  7 1
  9 0
  a 2
  3 2
  9 8
  9 5
  4 2
  5 s
  2 2
  5 6
  3 4
  1 4
  4 8
  4 g
  5 3
  3 4
  4 1
  d f
  5 9
  q w

Thanks in advance for any and all help.

Postscript

Please bear in mind,

  1. My input file is much larger than the one depicted in this question.

  2. My computer science skills are seriously limited.

  3. This task has been imposed on me.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
awk -v n=4 '
    function join(start, end,    result, i) {
        for (i=start; i<=end; i++)
            result = result $i (i==end ? ORS : FS)
        return result
    }
    {
        c=0
        for (i=1; i<NF; i+=n) {
            c++
            col[c] = col[c] join(i, i+n-1)
        }
    }
    END {
        for (i=1; i<=c; i++)
            printf "%s", col[i]  # the value already ends with newline
    }
' file

The info page has a short primer on awk, so read that too.


Benchmarking

  1. create an input file with 1,000,000 columns and 8 rows (as specified by OP)

    #!perl
    my $cols = 2**20; # 1,048,576
    my $rows = 8;
    my @alphabet=( 'a'..'z', 0..9 );
    my $size = scalar @alphabet;
    
    for ($r=1; $r <= $rows; $r++) {
        for ($c = 1; $c <= $cols; $c++) {
            my $idx = int rand $size;
            printf "%s ", $alphabet[$idx];
        }
        printf "
    ";
    }
    
    $ perl createfile.pl > input.file
    $ wc input.file
           8  8388608 16777224 input.file
    
  2. time various implementations: I use the shell, so the timing output is different from bash's

    • my awk

      $ time awk -f columnize.awk -v n=4 input.file > output.file
      
      ________________________________________________________
      Executed in    3.62 secs   fish           external
         usr time    3.49 secs    0.24 millis    3.49 secs
         sys time    0.11 secs    1.96 millis    0.11 secs
      
      $ wc output.file
       2097152  8388608 16777216 output.file
      
    • Timur's perl:

      $ time perl -lan columnize.pl input.file > output.file
      
      ________________________________________________________
      Executed in    3.25 secs   fish           external
         usr time    2.97 secs    0.16 millis    2.97 secs
         sys time    0.27 secs    2.87 millis    0.27 secs
      
    • Ravinder's awk

      $ time awk -f columnize.ravinder input.file > output.file
      
      ________________________________________________________
      Executed in    4.01 secs   fish           external
         usr time    3.84 secs    0.18 millis    3.84 secs
         sys time    0.15 secs    3.75 millis    0.14 secs
      
    • kvantour's awk, first version

      $ time awk -f columnize.kvantour -v n=4 input.file > output.file
      
      ________________________________________________________
      Executed in    3.84 secs   fish           external
         usr time    3.71 secs  166.00 micros    3.71 secs
         sys time    0.11 secs  1326.00 micros    0.11 secs
      
    • kvantour's second awk version: Crtl-C interrupted after a few minutes

      $ time awk -f columnize.kvantour2 -v n=4 input.file > output.file
      ^C
      ________________________________________________________
      Executed in  260.80 secs   fish           external
         usr time  257.39 secs    0.13 millis  257.39 secs
         sys time    1.68 secs    2.72 millis    1.67 secs
      
      $ wc output.file
       9728 38912 77824 output.file
      

      The $0=a[j] line is pretty expensive, as it has to parse the string into fields each time.

    • dawg's python

      $ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file'
      [... 60 seconds later ...]
      $ wc output.file
       2049  8196 16392 output.file
      
  3. another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew

    • with many columns, few rows

      $ time gawk -f columnize.awk -v n=4 input.file > output.file
      
      ________________________________________________________
      Executed in    3.78 secs   fish           external
         usr time    3.62 secs  174.00 micros    3.62 secs
         sys time    0.13 secs  1259.00 micros    0.13 secs
      
      $ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file
      
      ________________________________________________________
      Executed in   17.73 secs   fish           external
         usr time   14.95 secs    0.20 millis   14.95 secs
         sys time    2.72 secs    3.45 millis    2.71 secs
      
      $ time mawk -f columnize.awk -v n=4 input.file > output.file
      
      ________________________________________________________
      Executed in    2.01 secs   fish           external
         usr time  1892.31 millis    0.11 millis  1892.21 millis
         sys time   95.14 millis    2.17 millis   92.97 millis
      
    • with many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram

      $ time mawk -f columnize.awk -v n=4 input.file > output.file
      
      ________________________________________________________
      Executed in   32.30 mins   fish           external
         usr time   23.58 mins    0.15 millis   23.58 mins
         sys time    8.63 mins    2.52 millis    8.63 mins
      

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...