gracefully stop php laravel sqs worker in Docker on ECS Fargate

Using AWS SQS to process asynchronous messages is a great way to handle scheduled jobs, and work that doesn’t need to happen in real-time inside your user driven application. Containerizing a PHP Laravel app and using an orchestration service like ECS Fargate allows you to easily run thousands of job queue workers in an infinite and embarrassingly parallel fashion.

php artisan queue:work sqs

If your work queue is inconsistent in depth and rate (i.e. “bursty”) you’ll find you need to scale-out and scale-in containers based on how much work is available. Starting containers is no problem; just swipe your credit card and ECS delivers. The problem comes when needing to scale-in and stop containers after they are no longer needed, because ECS stops the php workers mid-job.

During autoscaling actions, when the ECS agent stops tasks it sends the equivalent of a docker stop to each container in the task. Underneath the covers it is sending the Unix process signal SIGTERM to the process inside the running container (PID 1). After the SIGTERM is sent, the ECS Agent waits 30 seconds for the process to exit, and if the process is still running after 30 seconds, the ECS Agent gives up and sends a SIGKILL. Sending SIGTERM (or SIGKILL) to the php process running the worker makes it immediately exit. This is expected but problematic because whatever the worker was working on is halted in the middle of what it was doing.

One solution to this problem is to wrap the php worker command inside of a bash script and use traps to catch the SIGTERM and give the worker some time to stop processing SQS messages and exit gracefully. A trap catches the signal sent to it, but it does not interrupt what the process is currently doing. The trap waits until the current process is finished, then it executes. Simply running the php worker with a trap is not enough, because the queue worker does not exit in between jobs, and php artisan queue:work sqs is a long running process. Because of this we use an infinite loop (while true; do; done;) and the –once flag to “single-run” php workers over and over. This means that for every SQS message (or empty receive) a new one-off php process is run. Doing it this way means that the trap can execute (and exit the script) in between jobs when the current job finishes processing. Something like this:

#!/bin/bash
exit_trap(){
  echo "received SIGTERM, exiting..."
  exit 0
}

trap exit_trap SIGTERM

while true
do
  php artisan queue:work sqs --once
done

caveat emptor

  • Running php workers with –once means the entire framework has to bootstrap for every message, which may add some extra processing time. But honestly, if your framework takes a long time to load you have bigger problems.
  • Running a new php process for every message can solve the “leaky” nature of php or laravel or problematic code and long running php worker processes slowly consuming more and more memory.
  • Workers must go from listening, to working, to finished with current job within 30 seconds (or less depending on the default receive message wait time). If the worker takes longer than 30 seconds it will receive a SIGKILL mid-job and die.
  • Using the EC2 Launch Type instead of Fargate will allow you to tweak the docker stop grace period. This value is not configurable with the Fargate Launch Type and you are stuck with 30 seconds.
  • Bash is used here, and PID 1 becomes a bash script instead of a php command.
  • It’s even more important for jobs to be idempotent, and have the ability to be re-run trivially at any time.
share:

percentile apache server request response times

I needed a hack to quickly find the 95th percentile of apache request response times. For example I needed to be able to say that “95% of our apache requests are served in X milliseconds or less.” In the apache2 config the LogFormat directive had %D (the time taken to serve the request, in microseconds) as the last field. Meaning the last field of each log line would be the time it took to serve the request. This would make it easy to pull out with $NF in awk

# PCT=.95; NR=`cat access.log | wc -l `; cat /var/log/apache2/access.log | awk '{print $NF}' | sort -rn | tail -n+$(echo "$NR-($NR*$PCT)" |bc | cut -d. -f1) |head -1
938247

In this case 95% of the apache requests were served in 938 milliseconds or less (WTF?!). Then run on an aggregated group of logs, or change the date/time range to just run for logs on a particular day, or for multiple time periods.

Note: I couldn’t get scale to work here in bc for some reason.

share:

mercurial hg clone turn off host key checking for bitbucket.org

If you clone a repository during an automated code deploy (for example in AWS CodeDeploy or Atlassian Bamboo) then you probably need to turn off host key checking for the clone of your repository. This prevents hg (or git) from raising a user prompt about the authenticity of the host key.

$ echo -e "Host bitbucket.org\nStrictHostKeyChecking no\n" >> ~/.ssh/config
share:

count new connections per minute to a tcp port

I was running a custom FTP service out of inetd, when it intermittently stopped responding to requests (Connection refused.) In the logs inetd was logging:

Mar 23 06:54:36 fordodone inetd[1510]: ftp/tcp server failing (looping), service terminated for 10 min

After some searching I discovered this error happens when there are too many connections to an inetd service per minute. How many is too many? From the man page for inetd.conf we can see that the default is 256. So the aggregate number of opening connections was over 256 per minute and inetd stops responding for 10 minutes to protect itself and the system from running out of resources. I increased the default to 512 (debian system) and restarted inetd for now.

# echo 'OPTIONS="-R"' >> /etc/default/openbsd-inetd  && service openbsd-inetd restart

How close am I to the 256 default? How often would it happen? Is there a pattern? Could this be legit traffic or a DoS attack? I wrote this one liner to see new or opening connections to the ftp control port per minute. You could change it a little for other services.

# tcpdump -lni eth1 "tcp[13] & 2 != 0" and dst port 21 2>/dev/null | while read i ; do j=`echo $i | cut -d : -f -2`; if [ "$k" == "$j" ]; then l=$(($l+1)); else echo "$k -- $l"; k=$j; l=1; fi; done;

Start with tcpdump on the interface you want to listen(-i eth1), no need to resolve hostnames(-n), or buffer output(-l), and look at the TCP flags byte (tcp[13]) (13th byte) for the SYN bit (2) to see if it is set, and only if the destination port is 21. Pipe it to a while loop and read in the lines as they come. Note the hour:minute, and count packets for that minute. If the minute changes, output the last minutes count, and reset the counter.

You have to ignore the first 2 lines. The first one means nothing, and the second one is missing the portion of the minute that was before you started it. The real results start to roll in on iteration 3.

 --
17:26 -- 6
17:27 -- 21
17:28 -- 20
17:29 -- 34
17:30 -- 38
17:31 -- 27
17:32 -- 37
17:33 -- 22
17:34 -- 23
17:35 -- 33
17:36 -- 29
17:37 -- 23
17:38 -- 28
17:39 -- 26
17:40 -- 73
17:41 -- 99
17:42 -- 132
17:43 -- 110
17:44 -- 130
17:45 -- 112
17:46 -- 109
17:47 -- 104
17:48 -- 182
17:49 -- 155
17:50 -- 145
17:51 -- 110
17:52 -- 154
17:53 -- 147
17:54 -- 86
17:55 -- 39
17:56 -- 39
17:57 -- 30
17:58 -- 30
17:59 -- 38
share: