gracefully stop php laravel sqs worker in Docker on ECS Fargate

Using AWS SQS to process asynchronous messages is a great way to handle scheduled jobs, and work that doesn’t need to happen in real-time inside your user driven application. Containerizing a PHP Laravel app and using an orchestration service like ECS Fargate allows you to easily run thousands of job queue workers in an infinite and embarrassingly parallel fashion.

php artisan queue:work sqs

If your work queue is inconsistent in depth and rate (i.e. “bursty”) you’ll find you need to scale-out and scale-in containers based on how much work is available. Starting containers is no problem; just swipe your credit card and ECS delivers. The problem comes when needing to scale-in and stop containers after they are no longer needed, because ECS stops the php workers mid-job.

During autoscaling actions, when the ECS agent stops tasks it sends the equivalent of a docker stop to each container in the task. Underneath the covers it is sending the Unix process signal SIGTERM to the process inside the running container (PID 1). After the SIGTERM is sent, the ECS Agent waits 30 seconds for the process to exit, and if the process is still running after 30 seconds, the ECS Agent gives up and sends a SIGKILL. Sending SIGTERM (or SIGKILL) to the php process running the worker makes it immediately exit. This is expected but problematic because whatever the worker was working on is halted in the middle of what it was doing.

One solution to this problem is to wrap the php worker command inside of a bash script and use traps to catch the SIGTERM and give the worker some time to stop processing SQS messages and exit gracefully. A trap catches the signal sent to it, but it does not interrupt what the process is currently doing. The trap waits until the current process is finished, then it executes. Simply running the php worker with a trap is not enough, because the queue worker does not exit in between jobs, and php artisan queue:work sqs is a long running process. Because of this we use an infinite loop (while true; do; done;) and the –once flag to “single-run” php workers over and over. This means that for every SQS message (or empty receive) a new one-off php process is run. Doing it this way means that the trap can execute (and exit the script) in between jobs when the current job finishes processing. Something like this:

#!/bin/bash
exit_trap(){
  echo "received SIGTERM, exiting..."
  exit 0
}

trap exit_trap SIGTERM

while true
do
  php artisan queue:work sqs --once
done

caveat emptor

  • Running php workers with –once means the entire framework has to bootstrap for every message, which may add some extra processing time. But honestly, if your framework takes a long time to load you have bigger problems.
  • Running a new php process for every message can solve the “leaky” nature of php or laravel or problematic code and long running php worker processes slowly consuming more and more memory.
  • Workers must go from listening, to working, to finished with current job within 30 seconds (or less depending on the default receive message wait time). If the worker takes longer than 30 seconds it will receive a SIGKILL mid-job and die.
  • Using the EC2 Launch Type instead of Fargate will allow you to tweak the docker stop grace period. This value is not configurable with the Fargate Launch Type and you are stuck with 30 seconds.
  • Bash is used here, and PID 1 becomes a bash script instead of a php command.
  • It’s even more important for jobs to be idempotent, and have the ability to be re-run trivially at any time.