Using find to act on files is very useful, but if the files that are found need different actions based on their filetype, it gets a bit trickier. For example there are some log files foo.log but after 10 days they get compressed to foo.log.gz. So you are finding regular text files, as well as gzipped text files. Extend your find with an -exec and a bash shell to determine what file extension it is, and to run the appropriate grep or zgrep based on that. Then run it through awk or whatever else to parse out what you need.

# find . -type f -name 'foo.log*' -exec bash -c 'if [[ $0 =~ .log$ ]]; then grep foobar $0; elif [[ $0 =~ .log.gz$ ]]; then zgrep foobar $0; fi' {} \; | awk '{if(/typea/)a++; if(/typeb/)b++; tot++} END {print "typea: "a" - "a*100/tot"%"; print "typeb: "b" - "b*100/tot"%"; print "typec: "tot-(a+b)" - "(tot-(a+b))*100/tot"%"; print "total: "tot;}'
typea: 5301 - 67.4771%
typeb: 2539 - 32.3192%
typec: 16 - 0.203666%
total: 7856

find and search for string in gzipped and text logs

find logs dating back 3 weeks, if they are gzipped use zgrep, if they are a regular text log use grep, if they aren’t a log do nothing, search for the string in the found log file

# find /mnt/toaster1/logs/app_logs/application1/2014 -type f -mtime -21 -exec bash -c 'if [[ $0 == *.log ]]; then g=grep; elif [[ $0 == *.gz ]]; then g=zgrep; else g=:; fi; $g "foostring" $0' {} \;

find directories owned by root

Find the directories owned by root in a certain part of the tree:

# find . -depth -mindepth 1 -maxdepth 3 -type d -ls | awk '$5 ~ /root/ {print}'
  7930    0 drwxr-xr-x  12 root root      115 Oct 11 16:44 ./562
3805069    0 drwxr-xr-x   3 root root       20 Oct 11 16:44 ./562/8562
  7946    0 drwxr-xr-x   5 root root       46 Dec  8 23:52 ./563/6563
  7947    0 drwxr-xr-x   3 root root      21 Oct 21  2008 ./563/6563/456563
3464735    0 drwxr-xr-x   2 root root        6 Sep 26 17:29 ./563/6563/436563
4075144    4 drwxr-xr-x   2 root root     4096 Dec  9 00:39 ./563/6563/2366563

Change all the ownership to www-data:

# find . -depth -mindepth 1 -maxdepth 3 -type d -exec chown www-data: {} \;

You could do this:

# cd .. && chown -R www-data: dirname

But we only suspect the problem at a certain level in the tree, and it would be way slow to recursively chown hundreds of millions of files.


sort nested directories by last modified using find

Using ls -lt to sort a file listing by last modified time is simple and easy. If you have a large directory tree with tens of thousands of directories, using find with some massaging might be the way to go. In this example there is a directory with many directories in a tree like this:


we are interested in the 3rd level directory and getting a list of which ones were most recently modified

# find . -mindepth 3 -maxdepth 3 -ls | awk '$10 !~ /^20[01]/' | sed -e 's/:/ /' | sort -k8,8M -nk9,9n -nk10 -nk11 | awk '{print $12" "$8" "$9" "$10":"$11}'| column -t | tail -10

We start by finding only 3rd level directories with extended listings (there are no files at this level, so -type d is unnecessary). Then use awk to only print directories that have been modified this year (i.e. anything with a year like 200* or 201* instead of a hour:minute in column 10). Replace the time colon HH:MM so that we can sort by minute after we sort by hour. Then rearrange the columns, add back the hour:minute colon, run it through column to get nice columns, then get the last 10 results.

./586/1586/1311586  Sep  16  16:11
./980/6980/2326980  Sep  16  16:18
./616/3616/513616   Sep  16  16:20
./133/9133/2119133  Sep  16  16:21
./422/6422/2106422  Sep  16  16:24
./566/6566/2326566  Sep  16  16:46
./672/672/2310672   Sep  16  16:51
./680/680/2290680   Sep  16  17:42
./573/5573/2325573  Sep  16  17:47
./106/1106/2321106  Sep  16  17:49

find music directories

I was recently handed an old Windows laptop, and told “It is broken so I know you can put it to use, and if you get my music off of it that would be awesome.” Right away I knew I had a great chance of recovering everything from the hard drive.

I took the hard drive out of the laptop and plugged it into my workstation via a SATA to USB converter. It showed right up and I mounted the partition that I thought would be the windows partition:

# ls
autoexec.bat  config.sys  doctemp                 found.000  found.003     MSOCache     pagefile.sys  ProgramData       Program Files              Users
Boot          DELL        Documents and Settings  found.001  hiberfil.sys  newfile.enc  pending.un    ProgramData.LOG1  $Recycle.Bin               Windows
bootmgr       dell.sdr    Drivers                 found.002  Intel         newkey       PerfLogs      ProgramData.LOG2  System Volume Information

Well, that looks familiar. Then went into the person’s user directory and did this:

# find . -type f -name '*.m4a' -o -name '*.mp3' -ls > ~/music_file_list

I could have been more thorough and looked for more file extensions (acc,m4u, etc.), but I figured iTunes would just put every music file in the same folder. The resultant file looked like this:

# tail music_file_list
 82837 9632 -rw-------   2 fordodone fordodone  9861791 Sep 16  2008 ./Users/laptop/Music/iTunes/Alanis\ Morissette\ -\ Jagged\ Little\ Pi\ 12.mp3
   307 27696 -rw-------   2 fordodone fordodone 28357414 Oct 22  2008 ./Users/laptop/Searches/Documents/3L\ First\ Semester/Energy/dem\ now.mp3
 53814 4856 -rw-------   2 fordodone fordodone  4972361 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Bouncing\ Around\ The\ Room.mp3
 53817 6116 -rw-------   2 fordodone fordodone  6259086 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Come\ Together.mp3
 53834 8132 -rw-------   2 fordodone fordodone  8325962 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Funky\ Bitch.mp3
 53962 31512 -rw-------   2 fordodone fordodone 32266213 Dec 21  2004 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Inflate-_Barnacles.mp3
 53975 4424 -rw-------   2 fordodone fordodone  4527885 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Julius.mp3
 53979 12288 -rw-------   2 fordodone fordodone 12579091 Apr  1  2002 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Mike's\ Song.mp3
 54019 8476 -rw-------   2 fordodone fordodone  8677963 Mar 31  2002 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Vultures.mp3
 54028 6004 -rw-------   2 fordodone fordodone  6146289 Feb 17  2007 ./Users/laptop/Searches/Documents/Old\ Computer/My\ Music/01\ Wilson.mp3

Now the goal was to get a list of unique directories in which music could be found. I would then take that list and rsync those directories to a local hard drive. Since the music files could be located at any unpredictable level in the tree, and I only wanted the directory listing I did this:

# cat music_file_list | cut -d / -f 2- | rev | cut -d / -f 2- | rev | sort | uniq -c
    926 Users/laptop/Music/iTunes
     27 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/GreenBiz\ Radio
     10 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/NPR_\ Planet\ Money\ Podcast
     51 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/This\ American\ Life
      4 Users/laptop/Music/iTunes/iTunes\ Music/Podcasts/WNYC's\ Radiolab
      2 Users/laptop/Music/iTunes/iTunes\ Music/Smeal\ College\ of\ Business/Wall\ Street\ Bootcamp\ Series
      1 Users/laptop/Searches/Documents/3L\ First\ Semester/Energy
      8 Users/laptop/Searches/Documents/Old\ Computer/My\ Music

That gave me the list I was looking for and how many mp3 and m4a files were in each unique directory. I’ll probably skip the Podcasts, and just recover the rest. It looks like this will be about 30G of files, so I will probably use to upload and share this amount of data.

TODO: revisit this exercise with awk.


find files modified today

find is an amazing command. With the proper manipulation it can be used to massage out the file data you need.

To find files modified in the last 24 hours is straight fowrard. Just look for files with a modified time of now() minus 1 day (24 hours):

find . -type f -mtime -1 -ls 

But if you just want files modified today it’s a bit more involved:

touch -t `date +%m%d0000` /tmp/$$
find . -type f -newer /tmp/$$ -ls
rm /tmp/$$

touch a file with the timestamp of 12am today. The file can have any name, but we just use the bash pid here. Then find files newer than that file. It will return files modified some time since 12am. Then remove the touched file.


delete files with unrecognized characters

For whatever reason you may find some files with unrecognized or missencoded characters that need to be removed. Because the terminal doesn’t recognize the characters it’s difficult to do anything with them.

# ls -l
-rw-r--r-- 1 www-data www-data 14828193 Nov 26  2008 ?¡?ú?©?ç?}?¤?I?@áÁ?????�?ï?i?r?g?j?j [51] 2008.10.02 ?w?¼?e?Ɋm?F?µ?Ă݂܂·?I?x - HirataTalk +AB Quiz.wma
-rw-r--r-- 1 www-data www-data 14568695 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [01] 2007.08.31 - ?ò?é?݂䂫.wma
-rw-r--r-- 1 www-data www-data 11898139 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [02] 2007.09.07 - kukui.wma
-rw-r--r-- 1 www-data www-data 11642799 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [03] 2007.09.14 - ?ێu???ê?N.wma

Use the -i flag with ls to obtain the inode number of the files:

# ls -li
6886578 -rw-r--r-- 1 www-data www-data 14828193 Nov 26  2008 ?¡?ú?©?ç?}?¤?I?@áÁ?????�?ï?i?r?g?j?j [51] 2008.10.02 ?w?¼?e?Ɋm?F?µ?Ă݂܂·?I?x - HirataTalk +AB Quiz.wma
6886580 -rw-r--r-- 1 www-data www-data 14568695 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [01] 2007.08.31 - ?ò?é?݂䂫.wma
6886581 -rw-r--r-- 1 www-data www-data 11898139 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [02] 2007.09.07 - kukui.wma
6886582 -rw-r--r-- 1 www-data www-data 11642799 Nov 26  2008 ?V?g?ƈ«???̎·???J?t?F?ւ悤?±?» [03] 2007.09.14 - ?ێu???ê?N.wma

Now use find with the -inum flag to find only the file with a specific inode number. Then delete it:

# find . -inum 6886578 -delete

find exec with grep pipe

If you have to search 62,000 log files for a specific string what’s the best way to do it? This will not work:

# zgrep string www1*/apache2/*/*/*/*error*.log.gz

Because shell will expand the list, there will be too many arguments for zgrep to process.

Instead use find to find the list of logfiles. You could redirect to a file, then run a forloop on each one, but we can just use -exec with find to run commands on the log files as we find them. This is nice, because you can process the files, and have output as it chugs along. Either of these works:

# find www1*/apache2/*/*/*/ -name '*error*.log.gz' -exec zgrep string {} \;

# find www1*/apache2/*/*/*/ -name '*error*.log.gz' -exec sh -c 'zgrep string $0' {} \;

In my head it sounds something like this: “find the files in the matching directories, that are named like ‘*error*.log.gz’, and as you find them, execute a command on them. The command is a new shell command to zgrep for the string in the file you just found.”

The first one works fine, BUT if you need to pipe your zgrep or whatever to some other command you need to execute a sub shell for that.

## do sed substitution after
-exec sh -c 'zgrep string $0 | sed -e \'s/A/B/g\'' {} \;

## read backwards and find first (aka last) occurrence
-exec sh -c 'zcat $0 | tac | grep -m1 string' {} \;

Always use single quotes for the subshell command sh -c , becuase you don’t want the current shell to interpret it, but pass the $0 as a literal so that the subshell can interpret it. The $0 in the subshell refers to the FIRST argument it is passed, which in this case is {}, or the file that find has currently found.


remove many empty directories

# find . -depth -mindepth 1 -maxdepth 3 -type d -exec rmdir {} \;

This finds directories, between 1 and 3 levels deep and attempts to remove them. The -depth flag finds the deepest child directories, before finding parents. This is great, because it tries to remove foo/bar/ before it will try to remove foo/. Without removing foo/bar/ first, rmdir foo/ would fail. Because rmdir will fail if there are any contents in a directory, the operation is safe to run without removing any files. You could redirect STDERR to a file, and capture all the directories that are not empty for processing later.