Graceful Shutdown

TL;DR:
This article discusses the importance of Graceful Shutdown in microservices,
the shutdown signals (SIGTERM
, SIGINT
and SIGKILL
)
and three perspectives on this challenge: Go, Kubernetes, and Istio.
Introduction
Horizontal scalability is one of the key advantages of modern software architectures to ensure systems remain responsive under varying loads.
In other words, replicas of microservices are scaled up when the load increases and scaled down when the load decreases.
For this elasticity (increase and decrease of replicas) to be seamless and transparent, it is important for the initialization time of microservices to be low. The faster a service initializes, the quicker it becomes available to handle requests and share the load with other replicas.
What about the shutdown of microservices?
What happens if a replica receives a shutdown signal while it’s processing requests?
Some situations where a replica may receive a shutdown signal include:
- load decrease
- rolling update
- rolling restart
Normally, upon receiving a signal, the processing of these requests would be interrupted, and clients would receive errors.
Unless the service has a smarter shutdown process: graceful shutdown.
Understanding the signals
Signals are primarily used in Unix-like systems and are sent by the kernel or some other program.
The main shutdown/termination signals for a program are SIGTERM
, SIGINT
and SIGKILL
.
SIGTERM
is a generic signal used to cause program termination.
It is the signal generated by the kill
command.
SIGTERM
signal to kill a Pod.The SIGINT
signal is sent when the user types CTRL-c
.
SIGKILL
is used to immediately terminate a program.
It cannot be intercepted or ignored and is thus always fatal.
Graceful shutdown
Graceful shutdown refers to a controlled and seamless termination process that avoids system harm.
To enable graceful shutdown, microservices should effectively handle the SIGTERM
and SIGINT
signals mentioned above.
The default behavior of most technologies is to abruptly stop program processing, which often negatively impacts functionality.
For example, in Go, a synchronous signal is converted into a runtime panic
.
A simple way to handle these signals is to wait a few seconds for processing to complete. However, it may be necessary to close connections to databases, Redis, or a message broker, for example.
Graceful shutdown can be implemented directly in the service’s code. However, Kubernetes and Istio have configurations that can help in this task.
Go
The most common implementation of graceful shutdown in Go is using Goroutines and Channels, as shown in the example below.
In this example, the HTTP server is initialized in a new Goroutine while the main Goroutine waits for a signal on the quit
channel.
Once a signal is received, the server is shut down with a timeout of 5 seconds.
If there are still active connections after 5 seconds, the Shutdown()
function returns an error.
Shutdown()
function was introduced in Go1.8The main web frameworks in Go suggest implementations following this pattern with Goroutines and Channels:
For those who prefer to use third-party libraries specifically created for this purpose, I would recommend ory/graceful.
Kubernetes
In Kubernetes Pods,
it is possible to configure a hook called preStop
, which is invoked before the SIGTERM
signal is sent.
By setting a sleep
command in this hook, we can achieve graceful shutdown, as shown in the example below.
|
|
The sleep
interval should be sufficient for the Kubernetes endpoint change to propagate to kube-proxy,
Ingress Controller, CoreDNS, etc. For more details, refer to this article.
By default, Kubernetes waits for up to 30 seconds during the Pod shutdown process
before forcefully terminating the process (SIGKILL
, which cannot be intercepted).
sleep
command,
making it difficult to use Distroless images.Istio
Istio has the configuration TerminationDrainDuration
, which allows defining a pause before shutting down the sidecar.
Sidecar is a common concept in Service Mesh implementations.
It is a container that accompanies the application (which is also a container) within the Kubernetes Pod.
Thus, we have 2 containers inside the Pod: app + sidecar
.
The sidecar is a proxy (in the case of Istio, it is Envoy) that intermediates all Pod traffic, providing all the advantages of the Service Mesh.
When the proxy receives SIGTERM
or SIGINT
, it starts draining connections, preventing new connections and allowing existing connections to complete.
SIGTERM
is sent after the execution of the preStop
hook.The duration of this draining process is configurable both globally:
|
|
and per workload:
|
|
The default duration is 5 seconds.
Conclusion
The best approach to enable Graceful Shutdown depends on the scenario of each project/system.
There are projects where:
- modifying the microservices’ code is very laborious;
- Service Mesh is not used;
- the use of Distroless images is a priority;
- they are not running on Kubernetes;
- they are running on Kubernetes, using Istio, and have the agility to modify the microservices’ code. In this case, it is possible to combine multiple strategies to ensure zero downtime.
Share in the comments the challenges and lessons learned from your project! 😉
- What is the scenario of your project?
- What approach is used for graceful shutdown?
- How is the implementation in your preferred programming language and framework?