This post was originally published on the BetterDoc Dev-Blog.
This post tells the story of some missing data and how it turned out to be related to how we built our Docker images at work.
To give you the proper context it will talk about:
- the problem
- the root cause
- how we fixed it
- what we learned from this
Grab a coffee and make yourself comfortable, we’re going iiiiiiiiin!
The Mystery of the Missing Data
Recently a colleague reported some inconsistent data in two distinct parts of our system. Some data was missing that wasn’t supposed to be missing.
Without going into too much detail, this data is supposed to be synchronized through an eventing mechanism. One of our services emits events which another service picks up.
After digging a bit through the logs of both services I concluded that the receiving service never got the events in question, which means that the emitting service was misbehaving.
Where are my events!?
The emitting service is written in Ruby and emits the event in a Sidekiq job.
After some more log-digging it turned out that these jobs were running when the worker was shut down by Kubernetes. It seems like shutting down the service’s worker while it’s performing jobs leads to data loss?
What are you doing, Sidekiq!?
Surely that can’t be it, right? Surely a well maintained background processing library such as Sidekiq doesn’t lose jobs on shutdown?
Well, yes and no. The Deployment page in the Sidekiq wiki explains it quite well:
To safely shut down Sidekiq, you need to send it the TSTP signal as early as possible in your deploy process and the TERM signal as late as possible. TSTP tells Sidekiq to stop pulling new work and finish all current work. TERM tells Sidekiq to exit within N seconds, where N is set by the -t timeout option and defaults to 25. Using TSTP+TERM in your deploy process gives your jobs the maximum amount of time to finish before exiting.
If any jobs are still running when the timeout is up, Sidekiq will push those jobs back to Redis so they can be rerun later.
Your deploy scripts must give Sidekiq N+5 seconds to shutdown cleanly after the TERM signal. For example, if you send TERM and then send KILL 10 seconds later, you will lose jobs (if using Sidekiq) or duplicate jobs (if using Sidekiq Pro’s super_fetch).
Where especially the last sentence is interesting:
For example, if you send TERM and then send KILL 10 seconds later, you will lose jobs (if using Sidekiq) or duplicate jobs (if using Sidekiq Pro’s super_fetch).
So it seems like Sidekiq gets killed through a SIGKILL
, but why?
Docker, the ENTRYPOINT
, and PID1
Let me skip a bit ahead.
Our Docker image for the service in question contains a docker-entrypoint.sh
bash script which - as the name suggests - is the ENTRYPOINT
of the image.
A (very simplified) version of this script looks like this:
#!/bin/bash -eu
for cmd in "$@"; do
case "$cmd" in
bash) bash -i;;
console) bin/rails console;;
migrate) bin/rails db:migrate;;
web) bin/rails server -p "${PORT:-3000}";;
worker) bundle exec sidekiq -q default -q mailers;;
esac
done
The for
allows us to pass multiple commands, for example docker-entrypoint.sh migrate web
, which would first run the migrations, and only then start rails.
It’s super handy, and will become relevant at a later point, so I’ve decided to leave the for
in there.
Alright but what does this mean in practice?
What’s actually going on in the container?
Let’s take a peek at which processes are running when we invoke docker run my-container worker
:
PID ... COMMAND
1 ... /bin/bash -e ./docker-entrypoint.sh worker
8 ... sidekiq 5.0.5 app [0 of 5 busy]
Interesting.
We have bash
running as PID1 and sidekiq
as PID8.
Nothing out of the ordinary, right?
Well, not quite. It turns out that PID1 has great power, and with great power comes great responsibility.
Linux and PID1
In Unix-based operating systems, PID1 gets some special love:
- PID1 is expected to reap zombie processes (which is super metal)
- PID1 doesn’t get the default signal handling, which means it won’t terminate on
SIGTERM
orSIGINT
unless it explicitly registers handlers to do so (not so metal) - When PID1 dies all remaining processes are killed with
SIGKILL
, which cannot be trapped (very unmetal)
Alright, so which implications does that have when bash
is PID1, as in our case?
bash
actually does zombie reaping!bash
also registers signal handlers! Even when it’s PID1, it reacts as expected toSIGTERM
by shutting down.- It turns out, while
bash
does handleSIGTERM
it does not wait for its children to exit, which means those children get brutally murdered bySIGKILL
.
And now it makes sense that sidekiq
doesn’t get a chance to shutdown gracefully.
So what can we do about it?
While we actually could teach bash
to forward signals to its children, doing so is fairly brittle and a bit messy.
Instead let’s try to remove bash
from the equation.
Bashing bash
How do we remove bash
from the equation, you ask?
exec
is our friend here!
When using exec
it replaces the bash
process with whatever process the given command spawns.
With this in mind, let’s update our docker-entrypoint.sh
script!
#!/bin/bash -eu
for cmd in "$@"; do
case "$cmd" in
bash) exec bash -i;;
console) exec bin/rails console;;
migrate) bin/rails db:migrate;;
web) exec bin/rails server -p "${PORT:-3000}";;
worker) exec bundle exec sidekiq -q default -q mailers;;
esac
done
But wait, we didn’t exec
the migrate
command?
Yes, and for good reason.
Remember: exec
does replace the bash
process.
And since bash
is no more, it also won’t continue to execute our script.
If we’d exec
ed every single command, it would basically defeat the purpose of our for
-loop since something like docker-entrypoint.sh migrate web
would become impossible.
Assuming you use a script like the one above, a good rule of thumb is to only exec
“long running” commands.
That is commands which you’d arguably put at the end of the “command chain”, such as web
or worker
.
Alright, but does this solve our immediate issue?
Does this ensure that sidekiq
shuts down gracefully?
Actually, yes!
But it also means that sidekiq
now became PID1, which - as you might remember - comes with great responsibilities.
That’s Not My Responsibility
As you might remember from earlier, there are a few things which are special about PID1:
- it needs to reap zombies
- it needs to explicitly register signal handlers (e.g. for
SIGTERM
) - when it dies, everything else will be killed with
SIGKILL
And while sidekiq
actually does register signal handlers it doesn’t do zombie reaping, which could or could not become a problem down the road.
Luckily there exists a solution to this problem: tini
.
tini
is a super minimalistic init system, specifically written for Docker containers, and as such a perfect fit for the job.
Let’s put it to work, shall we?
tini
in Action
To use tini
we need to include it in our Docker image and set it as ENTRYPOINT
of our container:
FROM ruby:2.6
+ RUN apt-get update -qq \
+ && apt-get install -qq --no-install-recommends \
+ tini \
+ && apt-get clean \
+ && rm -rf /var/lib/apt/lists/*
# Copy app, install gems etc ...
- ENTRYPOINT ["./docker-entrypoint.sh"]
+ ENTRYPOINT ["tini", "--", "./docker-entrypoint.sh"]
Note the usage of [...]
in the ENTRYPOINT
; it’s important, see this article for the why.
And that’s it. Nothing more, nothing less.
What We’ve Learned
With these changes, our app becomes a well behaving citizen in Docker City. No more lost jobs, no more missing data.
Let’s revisit what we’ve learned:
- whatever you put into your Docker
ENTRYPOINT
runs as PID1 - PID1 is special
bash
does not forward signals to its children before dying- use
exec
to let your app replacebash
- let
tini
be a good PID1 citizen