percentile apache server request response times

I needed a hack to quickly find the 95th percentile of apache request response times. For example I needed to be able to say that “95% of our apache requests are served in X milliseconds or less.” In the apache2 config the LogFormat directive had %D (the time taken to serve the request, in microseconds) as the last field. Meaning the last field of each log line would be the time it took to serve the request. This would make it easy to pull out with $NF in awk

# PCT=.95; NR=`cat access.log | wc -l `; cat /var/log/apache2/access.log | awk '{print $NF}' | sort -rn | tail -n+$(echo "$NR-($NR*$PCT)" |bc | cut -d. -f1) |head -1
938247

In this case 95% of the apache requests were served in 938 milliseconds or less (WTF?!). Then run on an aggregated group of logs, or change the date/time range to just run for logs on a particular day, or for multiple time periods.

Note: I couldn’t get scale to work here in bc for some reason.

IRQBALANCE_BANNED_CPUS explained and working on Ubuntu 14.04

Despite the explanation and examples in the manual, there is still some confusion on how to use irqbalance to move interrupts to a specific set of CPUs. (For the sake of simplicity, I use the words CPUs in the same way Linux sees them; it doesn’t care whether they are physical, or cores, or threads.)

Let’s start with a box that has 8 CPUs

# grep processor /proc/cpuinfo 
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7

irqbalance uses a variable set at runtime ( IRQBALANCE_BANNED_CPUS ) to make decisions which CPUs should not receive interrupts. IRQBALANCE_BANNED_CPUS is a hexidecimal representation of a “bitmask” to ignore certain CPUs when deciding which CPU to use for a specific interrupt. If we have a box with 8 cpus we will need to start with a binary string 8 digits long: “00000000”

0 means “allow interrupts on this CPU”
1 means “prevent interrupts on this CPU”

In the case of “00000000” we are allowing interrupts on all 8 CPUs. In binary numbers the least significant bit is on the right, mapping the CPUs like this:
“CPU7 CPU6 CPU5 CPU4 CPU3 CPU2 CPU1 CPU0″

Let’s say we want to allow interrupts on the 6th and 7th CPU, and exclude the 1st, 2nd, 3rd, 4th, 5th, and 8th CPU from interrupts. Our bitmask in this case would look like this”

   1    0    0    1    1    1    1    1
"CPU7 CPU6 CPU5 CPU4 CPU3 CPU2 CPU1 CPU0"

This can get tricky. Most humans start counting from 1 but computers start counting at 0. So be careful: the 7th CPU is CPU6 in this case.

Now that we have our bitmask of “10011111” we need to convert it to hexadecimal. You could use this handy bash function:

#
bin2hex() {
   echo "obase=16;ibase=2;$1"|bc
}
# bin2hex 10011111
9F
#

So now we can edit /etc/default/irqbalance and set the variable:

export IRQBALANCE_BANNED_CPUS="9F" 

NOTE: I have been able to get irqbalance working on Ubuntu 14.04 trusty only by exporting the IRQBALANCE_BANNED_CPUS variable using “export” in /etc/default/irqbalance AND using the latest (at the time of this writing) irqbalance 1.0.9

Once you understand how the CPUs translate to the bitmask you can extrapolate out from there and come up with a way to ban specific CPUs from irqbalance for any number of CPUs.

monitor NetApp SnapMirror transfer speed

You may want to monitor the speed of a current snapmirror to get an idea of how fast the transfer is going. The speed might change throughout the day due to load, or disk bottleneck, etc. I started with this one-liner:

i=0;j=0; while true; do j=$i; i=`ssh toaster01 "snapmirror status -l volname" | grep Progress | awk '{print $2}'| cut -d \( -f2`; if [ $j -eq 0 ]; then sleep 1m; continue; fi; echo -n "$i-$j = ";echo "scale=2;($i-$j)/(1024*1024)" | bc | tr '\n' ' '; echo "GB/min"; sleep 1m; done;

Which lead to this short script:

#!/bin/bash
# FILE: netapp_sm_monitor.sh
# AUTHOR: For Do Done <fordodone@fordodone.com>
# DATE: 2014/03/26
# NOTES: 
# 

if [ $# -lt 2 ]
then
  echo ""
  echo "usage: netapp_sm_monitor.sh <filer> <srcvol> [-v]"
  echo ""
  exit
fi

i=0;
j=0; 

while true; 
do 
  j=$i; 
  i=`ssh $1 "snapmirror status -l $2" | grep Progress | awk '{print $2}'| cut -d \( -f2`; 
  if [ $j -eq 0 ]; 
    then 
    sleep 1m; 
    continue; 
  fi; 
  if [ "$3" == "-v" ]
  then
    echo -n "$i-$j = ";
  fi
  echo "scale=2;($i-$j)/(1024*1024)" | bc | tr '\n' ' '; echo "GB/min"; 
  sleep 1m; 
done;

convert log time seconds to readable date

[1122633.028643] end_request: I/O error, dev sdc, sector 0

When looking in logs, like dmesg, error messages are preceded by a number that represents the uptime on the server in seconds at the time of the error. So this I/O error happened 1122633 seconds after the machine booted. This means nothing to us. In order to see when the error happened, you need to convert the seconds of uptime into a readable date.

First get the date/time at which the server booted using who -b and convert to seconds. Then add the seconds of uptime from the error message, and then convert back to a human readable date:

# date --date="@$(echo $(date --date="`who -b | awk '{print $3" "$4}'`" +%s)+1122633|bc)"
# Tue Oct 22 00:03:33 PDT 2013

So this error happened shortly after midnight. Very interesting…

NetApp decode acp domain option

How does this option function to set a network? The acp.domain option is a convoluted decimal representation of the network portion of the IP address used for acp.

toaster*> options acp
acp.domain 65193
acp.enabled on
acp.netmask 65535
acp.port e0f

Take 65193 and convert it to binary: 1111111010101001. Then split it up into two (or more) octets: 11111110 10101001. Then convert each of the octets back to decimal: 254 169. Then reverse the order: 169 254. That is the acp network. The netmask portion is more straightforward. In this case our ACP network is 169.254/16.

You could hack a quick little one liner:

# for i in `echo "obase=2;65193" |bc | awk 'BEGIN{FS=""} {for(i=1;i<33;i++){printf $i; if(i==8)printf " ";}printf "\n"}'`; do echo "ibase=2;$i" |bc; done|tac | paste - - | sed 's/\t/./'
169.254
#

get directory mtime in unix time

In scripts when you need to compare last modification date of directories, you can get the date using stat in a unix timestamp or seconds from the Epoch:

# stat -c '%Z' /usr/local/sbin
1373673278

Using date you can get the same format like this:

# date +%s
1373673486

You could use this in a script to do something if a directory is older or newer than some amount of time:

#!/bin/bash
# FILE: sync_usr_local_sbin.sh
# AUTHOR: ForDoDone <fordodone at email.com>
# DATE: 2013-07-12
# NOTES: syncs /usr/local/sbin to hostxyz if it's been modified in the last 5 minutes
#

now=`date +%s`

uls_lastmtime=`stat -c '%Z' /usr/local/sbin`

uls_diff=$(echo $now - $uls_lastmtime |bc)

if [ $uls_diff -lt 300 ]
then
  rsync -a /usr/local/sbin/ hostxyz:/usr/local/sbin
fi

Of course rsync has a bunch of options to check whether it needs to do an update of files, this is just an example.