find fastest Ubuntu mirror with netselect

Using some internal metrics, netselect can tell you what mirror is “best” for you to use for downloading packages, or setting up your own mirror. netselect uses icmp to determine latency, and hop count between you and a mirror. It can take a list of many mirrors, tests them, and reports which one has the best (lowest) metric. If you want rsync or ftp as a preferred transport, you could change it to only look for one of those. Throw in a few verbose flags to get more output.

# netselect -s 20 `wget https://launchpad.net/ubuntu/+archivemirrors -q -O - | grep '>http' |cut -d / -f 3 | tr '\n' ' '`
    3 mirror.tcpdiag.net
   14 149.20.4.71
   17 nz.archive.ubuntu.com
   17 ftp.citylink.co.nz
   17 mirrors.easynews.com
   18 mirrors.nl.eu.kernel.org
   18 ubuntu.securedservers.com
   45 mirrors.cat.pdx.edu
   58 mirror.peer1.net
   67 mirror.pnl.gov
   77 76.73.4.58
   90 ubuntu.mirrors.tds.net
   95 mirror.steadfast.net
  100 ubuntu-archives.mirror.nexicom.net
  102 mirrors.gigenet.com
  105 mirrors.xmission.com
  109 ubuntu.mirror.constant.com
  115 mirror.cs.umn.edu
  117 ubuntu.bhs.mirrors.ovh.net
  120 mirrors.rit.edu

In this case it looks like mirror.tcpdiag.net is the best choice.

# ping -c 3 mirror.tcpdiag.net
PING mirror.tcpdiag.net (69.160.243.150) 56(84) bytes of data.
64 bytes from ip-69-160-243-150.static.atlanticmetro.net (69.160.243.150): icmp_req=1 ttl=59 time=3.11 ms
64 bytes from ip-69-160-243-150.static.atlanticmetro.net (69.160.243.150): icmp_req=2 ttl=59 time=2.85 ms
64 bytes from ip-69-160-243-150.static.atlanticmetro.net (69.160.243.150): icmp_req=3 ttl=59 time=3.27 ms

--- mirror.tcpdiag.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 2.852/3.081/3.275/0.185 ms

3ms is pretty close.

diff command outputs, not files

You can easily diff the output of commands instead of files. In this case hexdump prints thousands of lines, but I’m only interested in the difference:

# diff <(hexdump file1.bin) <(hexdump file2.bin)
1,2c1,2
< 0000000 6a49 b610 0000 0000 5733 7261 4465 4243
< 0000010 0000 0000 0001 0000 9006 4e0b 0b28 000f
---
> 0000000 6a49 b616 0000 0000 5733 7261 4465 4243
> 0000010 0000 0000 0001 0000 9006 4e11 0b28 000f

Run the hexdump in subshell using parenthesis, then redirect the output back to diff. I’m only interested in the 2 pieces that are different for each binary file:

# for i in `ls *.bin | sort -nk1.7`; do echo -n "$i: "; hexdump -C $i | grep '33 57 61 72 65 44\|4e 28 0b 0f 00' | awk '{if(NR==1) print $4;if(NR==2) print $12}' | paste - -; done | column -t 2>/dev/null
file0.bin:   1a  15
file1.bin:   19  14
file2.bin:   18  13
file3.bin:   17  12
file4.bin:   16  11
file5.bin:   15  10
file6.bin:   14  0f
file8.bin:   12  0d
file9.bin:   11  0c
file10.bin:  10  0b
file12.bin:  0e  09
file13.bin:  0d  08
file14.bin:  0f  0a
file15.bin:  0b  06
file16.bin:  0a  05
file17.bin:  09  04
file18.bin:  08  03
file19.bin:  07  02
file20.bin:  06  01
file21.bin:  05  00
file22.bin:  0c  07

show progress for dd

By default dd is silent. It just copies whatever blocks you want from in to out. In order to see progress, send it a USR1 signal using kill.

Start a useless dd:

# dd if=/dev/zero of=/dev/null

In another terminal find the pid:

# ps aux | grep dd | grep -v grep
root      7784 90.5  0.0   2884   560 pts/9    R+   10:01   0:06 dd if /dev/zero of /dev/null
#
# kill -USR1 7784

The original window will now show this:

# dd if=/dev/zero of=/dev/null
14501614+0 records in
14501614+0 records out
7424826368 bytes (7.4 GB) copied, 16.2149 seconds, 458 MB/s

Then you can ctrl+c it to get the final output:

# dd if=/dev/zero of=/dev/null
14501614+0 records in
14501614+0 records out
7424826368 bytes (7.4 GB) copied, 16.2149 seconds, 458 MB/s
16888077+0 records in
16888076+0 records out
8646694912 bytes (8.6 GB) copied, 19.3507 seconds, 447 MB/s

This one liner will start your dd, then monitor it and output progress every 20 seconds. Once the dd is finished it will stop and give your shell back.

dd if=/dev/zero of=/dev/null & pid=$! && sleep 20s && while true; do i=`ps aux | awk '{print $2}' | grep ^$pid$`; if [ "${i:-a}" !=  "$pid" ]; then break; fi; kill -USR1 $pid; sleep 20s; done;

get memcache statistics

To obtain some stats about a memcached process, use nc to talk directly to it:


# echo "stats" | nc -w 1  11211
STAT pid 1750
STAT uptime 29481383
STAT time 1369781775
STAT version 1.4.5
STAT pointer_size 64
STAT rusage_user 6974.991909
STAT rusage_system 13000.624488
STAT curr_connections 10
STAT total_connections 132871296
STAT connection_structures 1674
STAT cmd_get 227296759
STAT cmd_set 113549712
STAT cmd_flush 0
STAT get_hits 221783239
STAT get_misses 5513520
STAT delete_misses 0
STAT delete_hits 0
STAT incr_misses 36444
STAT incr_hits 19304751
STAT decr_misses 3
STAT decr_hits 19367598
STAT cas_misses 0
STAT cas_hits 0
STAT cas_badval 0
STAT auth_cmds 0
STAT auth_errors 0
STAT bytes_read 26222582271
STAT bytes_written 25905663432
STAT limit_maxbytes 67108864
STAT accepting_conns 1
STAT listen_disabled_num 0
STAT threads 4
STAT conn_yields 0
STAT bytes 27010093
STAT curr_items 366610
STAT total_items 113554333
STAT evictions 0
STAT reclaimed 5291352
END

#

printing large integers with awk

When printing with awk, it uses scientific notation by default. Take this snippet from an example file. The first column is a count of how many times a file is present, the second column is the md5sum of that file and the third is the number of bytes that the file is.

# tail -3 md5sums
  14737 113136892f2137aa0116093a524ade0b        53
  19402 1c7b413c3fa39d0fed40556d2658ac73        44
  52818 b7f10e862d0e82f77a86b522159ce3c8        45
#

If I wanted to sum up the number of files counted in this file, and how much total space they are all taking up, I do this:

# awk '{i=i+$1;j=j+($3*$1);} END {print i; print j}' md5sums
22412000
1.45255e+13

So awk counted 22412000 files, totaling about 14.5 TB. Let’s make that a little more readable:

# awk '{i=i+$1;j=j+($3*$1);} END {printf ("%d\n", i); printf("%d\n", j)}' md5sums
22412000
2147483647

Um… that’s not right. But 2147483647 is a special number. You should recognize it as the maximum value of a 32 bit unsigned integer or ((2^32)/2)-1. In this case printf doesn’t handle large integers at all. Instead, use print, but tell awk what the output format should look like:

awk 'BEGIN {OFMT = "%.0f"} {i=i+$1;j=j+($3*$1);} END {print i; print j}' md5sums 
22412000
14525468874034

find duplicate entry in sql dump

Recently, I tried to import a SQL dump created by mysqldump that somehow had a duplicate entry for a primary key. Here’s a sample of the contents:

INSERT INTO `table1` VALUES ('B97bKm',71029594,3,NULL,NULL,'2013-01-22 09:25:39'),('dZfUHQ',804776,1,NULL,NULL,'2012-09-05 16:15:23'),('hWkGsz',70198487,0,NULL,NULL,'2013-01-05 10:55:36'),('n6366s',69480146,1,NULL,NULL,'2012-
12-18 03:27:45'),('tBP6Ug',65100805,1,NULL,NULL,'2012-08-29 21:32:39'),('yfpewZ',18724906,0,NULL,NULL,'2013-03-31 17:12:58'),('UNz5qp',8392940,2,NULL,NULL,'2012-11-28 02:00:00'),('9WVpVV',71181566,0,NULL,NULL,'2013-01-25 06:15:03'),('kEP
Qu5',64972980,9,NULL,NULL,'2012-09-01 06:00:36')

It goes on for another 270,000 entries. I was able to find the duplicate value like this:

# cat /tmp/table1.sql | grep INSERT | sed -e 's/),/\n/g' | sed -e 's/VALUES /\n/' | grep -v INSERT | awk -F, '{print $2}' | sort | uniq -c | awk '{if($1>1) print;}'
    2 64590015
#

The primary key value 64590015 had 2 entries. I removed the spurious entry, and subsequently the SQL imported fine.

find exec with grep pipe

If you have to search 62,000 log files for a specific string what’s the best way to do it? This will not work:

# zgrep string www1*/apache2/fordodone.com/201*/*/*/*error*.log.gz

Because shell will expand the list, there will be too many arguments for zgrep to process.

Instead use find to find the list of logfiles. You could redirect to a file, then run a forloop on each one, but we can just use -exec with find to run commands on the log files as we find them. This is nice, because you can process the files, and have output as it chugs along. Either of these works:

# find www1*/apache2/fordodone.com/201*/*/*/ -name '*error*.log.gz' -exec zgrep string {} \;

# find www1*/apache2/fordodone.com/201*/*/*/ -name '*error*.log.gz' -exec sh -c 'zgrep string $0' {} \;

In my head it sounds something like this: “find the files in the matching directories, that are named like ‘*error*.log.gz’, and as you find them, execute a command on them. The command is a new shell command to zgrep for the string in the file you just found.”

The first one works fine, BUT if you need to pipe your zgrep or whatever to some other command you need to execute a sub shell for that.

## do sed substitution after
-exec sh -c 'zgrep string $0 | sed -e \'s/A/B/g\'' {} \;

## read backwards and find first (aka last) occurrence
-exec sh -c 'zcat $0 | tac | grep -m1 string' {} \;

Always use single quotes for the subshell command sh -c , becuase you don’t want the current shell to interpret it, but pass the $0 as a literal so that the subshell can interpret it. The $0 in the subshell refers to the FIRST argument it is passed, which in this case is {}, or the file that find has currently found.

remove many empty directories

# find . -depth -mindepth 1 -maxdepth 3 -type d -exec rmdir {} \;

This finds directories, between 1 and 3 levels deep and attempts to remove them. The -depth flag finds the deepest child directories, before finding parents. This is great, because it tries to remove foo/bar/ before it will try to remove foo/. Without removing foo/bar/ first, rmdir foo/ would fail. Because rmdir will fail if there are any contents in a directory, the operation is safe to run without removing any files. You could redirect STDERR to a file, and capture all the directories that are not empty for processing later.