count character occurrence rates in filenames

find all the files in a directory. Take out the first dot . put in by find. Remove slashes (can’t be a character in a filename). Use fold -w 1 (–width) the width option limits column output to 1 character, which puts each character on it’s own line. Don’t count spaces (we don’t care about them). Sort the output, count how many occurrences of each character happened. Sort output by least to most occurrences of characters.

find . -type f | sed -e 's/\.//' -e 's/\// /g' | fold -w 1 | grep -v '^ $' | sort | uniq -c | sort -nk1
      1 '
      7 ^
     22 ,
     29 (
     29 )
     40 #
     51 =
     72 ~
    214 @
    312 :
    672 Y
   1141 +
   1217 J
   1497 Z
   2813 G
   3696 U
   3727 H
   5168 O
   5654 N
   5700 X
   5721 K
  10185 R
  10590 W
  11414 F
  12412 A
  13114 E
  13424 C
  13904 z
  15369 Q
  15698 j
  18746 I
  20582 S
  30232 M
  39547 q
  44301 B
  44946 P
  54675 7
  74749 9
  74777 L
  78077 T
  83720 8
  86739 D
  87151 4
  92824 k
  93168 y
  94261 5
  96495 w
 105734 V
 135527 6
 193306 f
 215943 0
 239003 g
 274810 3
 284082 v
 291777 1
 305769 h
 329499 _
 353852 2
 397075 b
 493086 m
 513388 p
 523439 d
 539160 x
 654812 -
 697485 l
 717868 a
 728134 n
 843460 t
 862742 u
 883640 .
1059771 i
1060749 c
1109991 o
1227620 r
1326244 s
1440326 e

find music directories

I was recently handed an old Windows laptop, and told “It is broken so I know you can put it to use, and if you get my music off of it that would be awesome.” Right away I knew I had a great chance of recovering everything from the hard drive.

I took the hard drive out of the laptop and plugged it into my workstation via a SATA to USB converter. It showed right up and I mounted the partition that I thought would be the windows partition:

# ls
autoexec.bat  config.sys  doctemp                 found.000  found.003     MSOCache     pagefile.sys  ProgramData       Program Files              Users
Boot          DELL        Documents and Settings  found.001  hiberfil.sys  newfile.enc  pending.un    ProgramData.LOG1  $Recycle.Bin               Windows
bootmgr       dell.sdr    Drivers                 found.002  Intel         newkey       PerfLogs      ProgramData.LOG2  System Volume Information

Well, that looks familiar. Then went into the person’s user directory and did this:

# find . -type f -name '*.m4a' -o -name '*.mp3' -ls > ~/music_file_list

I could have been more thorough and looked for more file extensions (acc,m4u, etc.), but I figured iTunes would just put every music file in the same folder. The resultant file looked like this:

# tail music_file_list
 82837 9632 -rw-------   2 fordodone fordodone  9861791 Sep 16  2008 ./Users/laptop/Music/iTunes/Alanis\ Morissette\ -\ Jagged\ Little\ Pi\ 12.mp3
   307 27696 -rw-------   2 fordodone fordodone 28357414 Oct 22  2008 ./Users/laptop/Searches/Documents/3L\ First\ Semester/Energy/dem\ now.mp3
 53814 4856 -rw-------   2 fordodone fordodone  4972361 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Bouncing\ Around\ The\ Room.mp3
 53817 6116 -rw-------   2 fordodone fordodone  6259086 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Come\ Together.mp3
 53834 8132 -rw-------   2 fordodone fordodone  8325962 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Funky\ Bitch.mp3
 53962 31512 -rw-------   2 fordodone fordodone 32266213 Dec 21  2004 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Inflate-_Barnacles.mp3
 53975 4424 -rw-------   2 fordodone fordodone  4527885 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Julius.mp3
 53979 12288 -rw-------   2 fordodone fordodone 12579091 Apr  1  2002 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Mike's\ Song.mp3
 54019 8476 -rw-------   2 fordodone fordodone  8677963 Mar 31  2002 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Vultures.mp3
 54028 6004 -rw-------   2 fordodone fordodone  6146289 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Wilson.mp3

Now the goal was to get a list of unique directories in which music could be found. I would then take that list and rsync those directories to a local hard drive. Since the music files could be located at any unpredictable level in the tree, and I only wanted the directory listing I did this:

# cat music_file_list | cut -d / -f 2- | rev | cut -d / -f 2- | rev | sort | uniq -c
    926 Users/laptop/Music/iTunes
     27 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/GreenBiz\ Radio
     10 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/NPR_\ Planet\ Money\ Podcast
     51 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/This\ American\ Life
      4 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/WNYC's\ Radiolab
      2 Users/laptop/Music/iTunes/iTunes\ Music/Smeal\ College\ of\ Business/Wall\ Street\ Bootcamp\ Series
      1 Users/laptop/Searches/Documents/3L\ First\ Semester/Energy
      8 Users/laptop/Searches/Documents/Old\ Computer/My\ Music

That gave me the list I was looking for and how many mp3 and m4a files were in each unique directory. I’ll probably skip the Podcasts, and just recover the rest. It looks like this will be about 30G of files, so I will probably use adrive.com to upload and share this amount of data.

TODO: revisit this exercise with awk.

get current client IP addresses from web farm

To see what common IPs are connecting to your web farm, ssh to all of the servers and get a list of clients. Then sort it until you see most busy clients.

# for i in `seq 401 436`; do ssh www$i "netstat -natp | grep EST | grep apa | grep ":80 "| awk '{print \$5}' | cut -d : -f1"; done | sort | uniq -c | sort -nk1 | tail
      3 10.0.0.1
      3 10.0.0.10
      3 10.245.34.2
      4 10.29.45.89
      5 10.111.111.111
      5 10.239.234.234
      5 10.1.1.1
      5 10.2.2.2
      6 10.3.3.3
     10 10.100.100.100
#

The list shows the number of connections, and the client IP.

find duplicate entry in sql dump

Recently, I tried to import a SQL dump created by mysqldump that somehow had a duplicate entry for a primary key. Here’s a sample of the contents:

INSERT INTO `table1` VALUES ('B97bKm',71029594,3,NULL,NULL,'2013-01-22 09:25:39'),('dZfUHQ',804776,1,NULL,NULL,'2012-09-05 16:15:23'),('hWkGsz',70198487,0,NULL,NULL,'2013-01-05 10:55:36'),('n6366s',69480146,1,NULL,NULL,'2012-
12-18 03:27:45'),('tBP6Ug',65100805,1,NULL,NULL,'2012-08-29 21:32:39'),('yfpewZ',18724906,0,NULL,NULL,'2013-03-31 17:12:58'),('UNz5qp',8392940,2,NULL,NULL,'2012-11-28 02:00:00'),('9WVpVV',71181566,0,NULL,NULL,'2013-01-25 06:15:03'),('kEP
Qu5',64972980,9,NULL,NULL,'2012-09-01 06:00:36')

It goes on for another 270,000 entries. I was able to find the duplicate value like this:

# cat /tmp/table1.sql | grep INSERT | sed -e 's/),/\n/g' | sed -e 's/VALUES /\n/' | grep -v INSERT | awk -F, '{print $2}' | sort | uniq -c | awk '{if($1>1) print;}'
    2 64590015
#

The primary key value 64590015 had 2 entries. I removed the spurious entry, and subsequently the SQL imported fine.