diff command outputs, not files

You can easily diff the output of commands instead of files. In this case hexdump prints thousands of lines, but I’m only interested in the difference:

# diff <(hexdump file1.bin) <(hexdump file2.bin)
1,2c1,2
< 0000000 6a49 b610 0000 0000 5733 7261 4465 4243
< 0000010 0000 0000 0001 0000 9006 4e0b 0b28 000f
---
> 0000000 6a49 b616 0000 0000 5733 7261 4465 4243
> 0000010 0000 0000 0001 0000 9006 4e11 0b28 000f

Run the hexdump in subshell using parenthesis, then redirect the output back to diff. I’m only interested in the 2 pieces that are different for each binary file:

# for i in `ls *.bin | sort -nk1.7`; do echo -n "$i: "; hexdump -C $i | grep '33 57 61 72 65 44\|4e 28 0b 0f 00' | awk '{if(NR==1) print $4;if(NR==2) print $12}' | paste - -; done | column -t 2>/dev/null
file0.bin:   1a  15
file1.bin:   19  14
file2.bin:   18  13
file3.bin:   17  12
file4.bin:   16  11
file5.bin:   15  10
file6.bin:   14  0f
file8.bin:   12  0d
file9.bin:   11  0c
file10.bin:  10  0b
file12.bin:  0e  09
file13.bin:  0d  08
file14.bin:  0f  0a
file15.bin:  0b  06
file16.bin:  0a  05
file17.bin:  09  04
file18.bin:  08  03
file19.bin:  07  02
file20.bin:  06  01
file21.bin:  05  00
file22.bin:  0c  07

show progress for dd

By default dd is silent. It just copies whatever blocks you want from in to out. In order to see progress, send it a USR1 signal using kill.

Start a useless dd:

# dd if=/dev/zero of=/dev/null

In another terminal find the pid:

# ps aux | grep dd | grep -v grep
root      7784 90.5  0.0   2884   560 pts/9    R+   10:01   0:06 dd if /dev/zero of /dev/null
#
# kill -USR1 7784

The original window will now show this:

# dd if=/dev/zero of=/dev/null
14501614+0 records in
14501614+0 records out
7424826368 bytes (7.4 GB) copied, 16.2149 seconds, 458 MB/s

Then you can ctrl+c it to get the final output:

# dd if=/dev/zero of=/dev/null
14501614+0 records in
14501614+0 records out
7424826368 bytes (7.4 GB) copied, 16.2149 seconds, 458 MB/s
16888077+0 records in
16888076+0 records out
8646694912 bytes (8.6 GB) copied, 19.3507 seconds, 447 MB/s

This one liner will start your dd, then monitor it and output progress every 20 seconds. Once the dd is finished it will stop and give your shell back.

dd if=/dev/zero of=/dev/null & pid=$! && sleep 20s && while true; do i=`ps aux | awk '{print $2}' | grep ^$pid$`; if [ "${i:-a}" !=  "$pid" ]; then break; fi; kill -USR1 $pid; sleep 20s; done;

printing large integers with awk

When printing with awk, it uses scientific notation by default. Take this snippet from an example file. The first column is a count of how many times a file is present, the second column is the md5sum of that file and the third is the number of bytes that the file is.

# tail -3 md5sums
  14737 113136892f2137aa0116093a524ade0b        53
  19402 1c7b413c3fa39d0fed40556d2658ac73        44
  52818 b7f10e862d0e82f77a86b522159ce3c8        45
#

If I wanted to sum up the number of files counted in this file, and how much total space they are all taking up, I do this:

# awk '{i=i+$1;j=j+($3*$1);} END {print i; print j}' md5sums
22412000
1.45255e+13

So awk counted 22412000 files, totaling about 14.5 TB. Let’s make that a little more readable:

# awk '{i=i+$1;j=j+($3*$1);} END {printf ("%d\n", i); printf("%d\n", j)}' md5sums
22412000
2147483647

Um… that’s not right. But 2147483647 is a special number. You should recognize it as the maximum value of a 32 bit unsigned integer or ((2^32)/2)-1. In this case printf doesn’t handle large integers at all. Instead, use print, but tell awk what the output format should look like:

awk 'BEGIN {OFMT = "%.0f"} {i=i+$1;j=j+($3*$1);} END {print i; print j}' md5sums 
22412000
14525468874034

find duplicate entry in sql dump

Recently, I tried to import a SQL dump created by mysqldump that somehow had a duplicate entry for a primary key. Here’s a sample of the contents:

INSERT INTO `table1` VALUES ('B97bKm',71029594,3,NULL,NULL,'2013-01-22 09:25:39'),('dZfUHQ',804776,1,NULL,NULL,'2012-09-05 16:15:23'),('hWkGsz',70198487,0,NULL,NULL,'2013-01-05 10:55:36'),('n6366s',69480146,1,NULL,NULL,'2012-
12-18 03:27:45'),('tBP6Ug',65100805,1,NULL,NULL,'2012-08-29 21:32:39'),('yfpewZ',18724906,0,NULL,NULL,'2013-03-31 17:12:58'),('UNz5qp',8392940,2,NULL,NULL,'2012-11-28 02:00:00'),('9WVpVV',71181566,0,NULL,NULL,'2013-01-25 06:15:03'),('kEP
Qu5',64972980,9,NULL,NULL,'2012-09-01 06:00:36')

It goes on for another 270,000 entries. I was able to find the duplicate value like this:

# cat /tmp/table1.sql | grep INSERT | sed -e 's/),/\n/g' | sed -e 's/VALUES /\n/' | grep -v INSERT | awk -F, '{print $2}' | sort | uniq -c | awk '{if($1>1) print;}'
    2 64590015
#

The primary key value 64590015 had 2 entries. I removed the spurious entry, and subsequently the SQL imported fine.

dhcpd lease information sorted by date and time

When looking on a pxe boot install server, you can see what the newest clients were to boot. If you don’t have KVM access on new servers to be installed, just look at the newest lease info, and make an educated guess about which new one to login to the auto-installer environment (preseed) via ssh.

Here’s a snippet from the leases file:

lease 10.101.40.85 {
  starts 3 2013/05/15 19:54:36;
  ends 3 2013/05/15 20:54:36;
  cltt 3 2013/05/15 19:54:36;
  binding state active;
  next binding state free;
  hardware ethernet 00:30:48:5c:cf:34;
  uid "\001\0000H\\\3174";
}

And after some parsing:

# cat /var/lib/dhcp/dhcpd.leases | grep -e lease -e hardware -e start -e end | grep -v format | grep -v written | sed -e '/start/s/\// /g' -e 's/;//g' -e '/starts/s/:/ /g' | paste - - - - | awk '{print $2" "$18" "$6" "$7" "$8" "$9" "$10" "$11" "$14" "$15}' | sort -k 3,3n -k 4,4n -k 5,5n -k 6,6n -k 7,7n -k 8,8n | awk '{print $1" "$2" "$3"/"$4"/"$5" "$6":"$7":"$8" "$9" "$10}' | column -t


10.101.40.127  00:1e:68:9a:e5:ac  2013/04/26  22:02:58  2013/04/26  23:02:58
10.101.40.129  00:1e:68:9a:e5:ac  2013/04/26  23:10:01  2013/04/27  00:10:01
10.101.40.122  00:1e:68:9a:e5:ac  2013/04/26  23:27:57  2013/04/26  23:30:42
10.101.40.118  00:1e:68:9a:ee:69  2013/05/14  16:21:28  2013/05/14  17:21:28
10.101.40.85   00:30:48:5c:cf:34  2013/05/14  16:54:43  2013/05/14  17:54:43
10.101.40.118  00:1e:68:9a:ee:69  2013/05/14  17:14:04  2013/05/14  18:14:04
10.101.40.85   00:30:48:5c:cf:34  2013/05/14  17:24:43  2013/05/14  18:24:43
10.101.40.85   00:30:48:5c:cf:34  2013/05/14  17:54:42  2013/05/14  18:54:42
#

rsync migration with manifest of transfer

I once had a migration project to move 40TB of data that needed to be moved from source to destination NFS volumes. Naturally, I went with rsync. This was basically the command for the initial transfers:

rsync -a --out-format="transfer:%t,%b,%f" --itemize-changes /mnt/srcvol /mnt/destvol >> /log/file.log

Pretty simple right? The logs are a simple csv file that looked like this:

transfer:2013/05/02 10:16:13,35291,mnt/srcvol/archive/foo/bar/barfoo/IMAGES/1256562131100.jpg

The customer asked for daily updates on progress. I said no problem, and this one liner takes care of it:

# grep transfer /log/file.log | awk -F "," '{if ($2!=0) i=i+$2; x++} END {print "total Gbytes: "i/1073741824"\ntotal files: "x}'
total Gbytes: 1153.29
total files: 123686

From the rsync command above, the %t means timestamp (2013/05/02 10:16:13), the %b means bytes transferred (35291), and the %f means the whole file path. By adding up the %b column of output and counting how many times you added it, you get both the total bytes transferred and the total number of files transferred. Directories show up as 0 byte transfers so in awk we don’t count them. Also, I threw in the divide by 1073741824 (1024*1024*1024), which converts bytes to Gebibytes.

I ended up putting it in a shell script and adding options such as, just find transfers for a particular day/hour, better handling for the Gbytes number, rate calculation, and the ability to add logs from multiple data moving servers.

print last 3 characters of a string

If you need to get the last few characters of a string, you can just use awk.  This works great because you don’t have to know how long or short the string is to begin with. It first gets the length of the string. So in this example the length is 9 characters, and the ‘1’ would be in position 9 of the string. Then it subtracts 2 from position 9, to get position 7 (which is the ‘3’), then it gets 3 characters starting from position 7: ‘371’

# echo server371 | awk '{print substr($0,length($0)-2,3)}'
371

# echo anotherserver892 | awk '{print substr($0,length($0)-2,3)}'
892