awk average multiple columns

If you have some output lined up in columns, use awk to average the columns. Here’s some sample output (from a NetApp “toaster> stats show -p flexscale-access”)

    
# cat sample.txt
    73   5480      0   1040  84     0     0      0     0      0     0      0       541
    73   6038     39   1119  84     0     0      0     0      0     0      0       475
    73   5018     19    859  85     0     0      0     0      0     0      0       348
    73   5960     20   1480  80   120     0    320     0      0     0      0       427
    73   6098      0   1019  85     0     0      0     0      0     0      0       486
    73   5220      0   1220  81     0     0      0     0      0     0      0       288
    73   5758     79   1319  81    59    39    319     0      0     0      0       500
    73   4419      0   2039  68     0     0      0     0      0     0      0       279
    73   5400      0    840  86     0     0      0     0      0     0      0       382
    73   5238      0   1299  80     0     0      0     0      0     0      0       389
    73   5449      0   1696  76    59     0    199     0      0     0      0       340
    73   5478      0   1419  79     0     0      0     0      0     0      0       414
    73   5020     20   1000  83     0     0      0     0      0     0      0       405
    73   4359      0   1059  80     0     0      0     0      0     0      0       295
    73   5838     39   1139  83     0    19      0     0      0     0      0       494
    73   6100     40   1720  78     0     0      0     0      0     0      0       480
    73   5398     19   1239  81     0     0      0     0      0     0      0       398
    73   5089     79   1097  82     0     0      0     0      0     0      0       459
    73   6178     19   1159  84     0    39    159     0      0     0      0       487
    73   4999      0   1239  80     0     0      0     0      0     0      0       345
    73   4820      0    880  84     0     0      0     0      0     0      0       339
    73   5467      0   1177  82     0     0      0     0      0     0      0       413
    73   4700     60   1480  76     0     0      0     0      0     0      0       337
#

And the column averages:

# cat sample.txt | awk '{for (i=1;i<=NF;i++){a[i]+=$i;}} END {for (i=1;i<=NF;i++){printf "%.0f", a[i]/NR; printf "\t"};printf "\n"}'
73      5371    19      1241    81      10      4       43      0       0       0       0       405
#

Here awk loops through each field in a row, and adds the value to an array (a[i]) with the key being the field number. Then at the end, it takes the total, and divides by the number of rows (NR) and prints that (without decimals). It separates each field by a tab (\t) and after the end record prints a newline (\n).

You could make it print totals, as well as averages. You could also make it print out the original data, or a field header to know what each column represents...

get directory mtime in unix time

In scripts when you need to compare last modification date of directories, you can get the date using stat in a unix timestamp or seconds from the Epoch:

# stat -c '%Z' /usr/local/sbin
1373673278

Using date you can get the same format like this:

# date +%s
1373673486

You could use this in a script to do something if a directory is older or newer than some amount of time:

#!/bin/bash
# FILE: sync_usr_local_sbin.sh
# AUTHOR: ForDoDone <fordodone at email.com>
# DATE: 2013-07-12
# NOTES: syncs /usr/local/sbin to hostxyz if it's been modified in the last 5 minutes
#

now=`date +%s`

uls_lastmtime=`stat -c '%Z' /usr/local/sbin`

uls_diff=$(echo $now - $uls_lastmtime |bc)

if [ $uls_diff -lt 300 ]
then
  rsync -a /usr/local/sbin/ hostxyz:/usr/local/sbin
fi

Of course rsync has a bunch of options to check whether it needs to do an update of files, this is just an example.

SCP files from one host to another

Everyone knows how to copy files around using SCP, but it can be a pain if you have to enter passwords for every copy. If you have an administration host with shared ssh keys to every other host, you can just use a quick little one liner to drag files from hostA, through the admin box, over to hostB:

adminbox # ssh hostA "tar cf - /usr/local/sbin/myscript.sh 2>/dev/null" | ssh hostB "cd / && tar xvf - 1>/dev/null"

Using tar, the file is output to STDOUT and piped over ssh, then read from STDIN. It copies /usr/local/sbin/myscript.sh from hostA to hostB. Because the admin box has ssh keys to both hostA and hostB, the process is automatic and does not require password authentication. This means you can use this method in scripts for batch copies, etc. Also, you won’t have to create a temporary copy on the admin host.

Drop it into a simple shell script and it will be even easier:

#!/bin/bash
# FILE: file_dragger.sh
# AUTHOR: fordodone <fordodone at email.com>
# DATE: 2013/07/11
# NOTES: drags a file from one host to another
#

if [ $# -ne 3 ]
then
echo ""
echo "usage: </full/path/to/file> <src> <dst>"
echo ""
exit
fi

ssh $2 "tar cf - $1 2>/dev/null" | ssh $3 "cd / && tar xvf - 1>/dev/null"

To use it for the original copy example do this:

# file_dragger.sh /usr/local/sbin/myscript.sh hostA hostB
#

delete files with unrecognized characters

For whatever reason you may find some files with unrecognized or missencoded characters that need to be removed. Because the terminal doesn’t recognize the characters it’s difficult to do anything with them.

# ls -l
-rw-r--r-- 1 www-data www-data 14828193 Nov 26  2008 ?¡?ú?©?ç?}?¤?I?@áÁ?????�?ï?i?r?g?j?j [51] 2008.10.02 ?w?¼?e?Ɋm?F?µ?Ă݂܂·?I?x - HirataTalk +AB Quiz.wma
-rw-r--r-- 1 www-data www-data 14568695 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [01] 2007.08.31 - ?ò?é?݂䂫.wma
-rw-r--r-- 1 www-data www-data 11898139 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [02] 2007.09.07 - kukui.wma
-rw-r--r-- 1 www-data www-data 11642799 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [03] 2007.09.14 - ?ێu???ê?N.wma
#

Use the -i flag with ls to obtain the inode number of the files:

# ls -li
6886578 -rw-r--r-- 1 www-data www-data 14828193 Nov 26  2008 ?¡?ú?©?ç?}?¤?I?@áÁ?????�?ï?i?r?g?j?j [51] 2008.10.02 ?w?¼?e?Ɋm?F?µ?Ă݂܂·?I?x - HirataTalk +AB Quiz.wma
6886580 -rw-r--r-- 1 www-data www-data 14568695 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [01] 2007.08.31 - ?ò?é?݂䂫.wma
6886581 -rw-r--r-- 1 www-data www-data 11898139 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [02] 2007.09.07 - kukui.wma
6886582 -rw-r--r-- 1 www-data www-data 11642799 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [03] 2007.09.14 - ?ێu???ê?N.wma
#

Now use find with the -inum flag to find only the file with a specific inode number. Then delete it:

# find . -inum 6886578 -delete
#

diff command outputs, not files

You can easily diff the output of commands instead of files. In this case hexdump prints thousands of lines, but I’m only interested in the difference:

# diff <(hexdump file1.bin) <(hexdump file2.bin)
1,2c1,2
< 0000000 6a49 b610 0000 0000 5733 7261 4465 4243
< 0000010 0000 0000 0001 0000 9006 4e0b 0b28 000f
---
> 0000000 6a49 b616 0000 0000 5733 7261 4465 4243
> 0000010 0000 0000 0001 0000 9006 4e11 0b28 000f

Run the hexdump in subshell using parenthesis, then redirect the output back to diff. I’m only interested in the 2 pieces that are different for each binary file:

# for i in `ls *.bin | sort -nk1.7`; do echo -n "$i: "; hexdump -C $i | grep '33 57 61 72 65 44\|4e 28 0b 0f 00' | awk '{if(NR==1) print $4;if(NR==2) print $12}' | paste - -; done | column -t 2>/dev/null
file0.bin:   1a  15
file1.bin:   19  14
file2.bin:   18  13
file3.bin:   17  12
file4.bin:   16  11
file5.bin:   15  10
file6.bin:   14  0f
file8.bin:   12  0d
file9.bin:   11  0c
file10.bin:  10  0b
file12.bin:  0e  09
file13.bin:  0d  08
file14.bin:  0f  0a
file15.bin:  0b  06
file16.bin:  0a  05
file17.bin:  09  04
file18.bin:  08  03
file19.bin:  07  02
file20.bin:  06  01
file21.bin:  05  00
file22.bin:  0c  07

show progress for dd

By default dd is silent. It just copies whatever blocks you want from in to out. In order to see progress, send it a USR1 signal using kill.

Start a useless dd:

# dd if=/dev/zero of=/dev/null

In another terminal find the pid:

# ps aux | grep dd | grep -v grep
root      7784 90.5  0.0   2884   560 pts/9    R+   10:01   0:06 dd if /dev/zero of /dev/null
#
# kill -USR1 7784

The original window will now show this:

# dd if=/dev/zero of=/dev/null
14501614+0 records in
14501614+0 records out
7424826368 bytes (7.4 GB) copied, 16.2149 seconds, 458 MB/s

Then you can ctrl+c it to get the final output:

# dd if=/dev/zero of=/dev/null
14501614+0 records in
14501614+0 records out
7424826368 bytes (7.4 GB) copied, 16.2149 seconds, 458 MB/s
16888077+0 records in
16888076+0 records out
8646694912 bytes (8.6 GB) copied, 19.3507 seconds, 447 MB/s

This one liner will start your dd, then monitor it and output progress every 20 seconds. Once the dd is finished it will stop and give your shell back.

dd if=/dev/zero of=/dev/null & pid=$! && sleep 20s && while true; do i=`ps aux | awk '{print $2}' | grep ^$pid$`; if [ "${i:-a}" !=  "$pid" ]; then break; fi; kill -USR1 $pid; sleep 20s; done;

printing large integers with awk

When printing with awk, it uses scientific notation by default. Take this snippet from an example file. The first column is a count of how many times a file is present, the second column is the md5sum of that file and the third is the number of bytes that the file is.

# tail -3 md5sums
  14737 113136892f2137aa0116093a524ade0b        53
  19402 1c7b413c3fa39d0fed40556d2658ac73        44
  52818 b7f10e862d0e82f77a86b522159ce3c8        45
#

If I wanted to sum up the number of files counted in this file, and how much total space they are all taking up, I do this:

# awk '{i=i+$1;j=j+($3*$1);} END {print i; print j}' md5sums
22412000
1.45255e+13

So awk counted 22412000 files, totaling about 14.5 TB. Let’s make that a little more readable:

# awk '{i=i+$1;j=j+($3*$1);} END {printf ("%d\n", i); printf("%d\n", j)}' md5sums
22412000
2147483647

Um… that’s not right. But 2147483647 is a special number. You should recognize it as the maximum value of a 32 bit unsigned integer or ((2^32)/2)-1. In this case printf doesn’t handle large integers at all. Instead, use print, but tell awk what the output format should look like:

awk 'BEGIN {OFMT = "%.0f"} {i=i+$1;j=j+($3*$1);} END {print i; print j}' md5sums 
22412000
14525468874034

find exec with grep pipe

If you have to search 62,000 log files for a specific string what’s the best way to do it? This will not work:

# zgrep string www1*/apache2/fordodone.com/201*/*/*/*error*.log.gz

Because shell will expand the list, there will be too many arguments for zgrep to process.

Instead use find to find the list of logfiles. You could redirect to a file, then run a forloop on each one, but we can just use -exec with find to run commands on the log files as we find them. This is nice, because you can process the files, and have output as it chugs along. Either of these works:

# find www1*/apache2/fordodone.com/201*/*/*/ -name '*error*.log.gz' -exec zgrep string {} \;

# find www1*/apache2/fordodone.com/201*/*/*/ -name '*error*.log.gz' -exec sh -c 'zgrep string $0' {} \;

In my head it sounds something like this: “find the files in the matching directories, that are named like ‘*error*.log.gz’, and as you find them, execute a command on them. The command is a new shell command to zgrep for the string in the file you just found.”

The first one works fine, BUT if you need to pipe your zgrep or whatever to some other command you need to execute a sub shell for that.

## do sed substitution after
-exec sh -c 'zgrep string $0 | sed -e \'s/A/B/g\'' {} \;

## read backwards and find first (aka last) occurrence
-exec sh -c 'zcat $0 | tac | grep -m1 string' {} \;

Always use single quotes for the subshell command sh -c , becuase you don’t want the current shell to interpret it, but pass the $0 as a literal so that the subshell can interpret it. The $0 in the subshell refers to the FIRST argument it is passed, which in this case is {}, or the file that find has currently found.

remove many empty directories

# find . -depth -mindepth 1 -maxdepth 3 -type d -exec rmdir {} \;

This finds directories, between 1 and 3 levels deep and attempts to remove them. The -depth flag finds the deepest child directories, before finding parents. This is great, because it tries to remove foo/bar/ before it will try to remove foo/. Without removing foo/bar/ first, rmdir foo/ would fail. Because rmdir will fail if there are any contents in a directory, the operation is safe to run without removing any files. You could redirect STDERR to a file, and capture all the directories that are not empty for processing later.

merge directories with rsync

rsync -a --ignore-existing --remove-source-files src/ dest

Any existing files in the destination will not be overwritten. After it’s done, look in src to see what is also in destination, then diff to see which ones to manually keep, or quickly write a one-liner to compare time stamps and keep newer ones and overwrite older versions.