Graceful Shutdown

2021-09-01 987 words 5 minutes

Contents

TL;DR: This article discusses the importance of Graceful Shutdown in microservices, the shutdown signals (SIGTERM, SIGINT and SIGKILL) and three perspectives on this challenge: Go, Kubernetes, and Istio.

Introduction

Horizontal scalability is one of the key advantages of modern software architectures to ensure systems remain responsive under varying loads.

In other words, replicas of microservices are scaled up when the load increases and scaled down when the load decreases.

For this elasticity (increase and decrease of replicas) to be seamless and transparent, it is important for the initialization time of microservices to be low. The faster a service initializes, the quicker it becomes available to handle requests and share the load with other replicas.

Question

What about the shutdown of microservices?

What happens if a replica receives a shutdown signal while it’s processing requests?

Some situations where a replica may receive a shutdown signal include:

load decrease
rolling update
rolling restart

Normally, upon receiving a signal, the processing of these requests would be interrupted, and clients would receive errors.

Unless the service has a smarter shutdown process: graceful shutdown.

Understanding the signals

Signals are primarily used in Unix-like systems and are sent by the kernel or some other program.

The main shutdown/termination signals for a program are SIGTERM, SIGINT and SIGKILL.

SIGTERM is a generic signal used to cause program termination. It is the signal generated by the kill command.

Info

Kubernetes sends the SIGTERM signal to kill a Pod.

The SIGINT signal is sent when the user types CTRL-c.

SIGKILL is used to immediately terminate a program. It cannot be intercepted or ignored and is thus always fatal.

Graceful shutdown

Graceful shutdown refers to a controlled and seamless termination process that avoids system harm.

To enable graceful shutdown, microservices should effectively handle the SIGTERM and SIGINT signals mentioned above.

The default behavior of most technologies is to abruptly stop program processing, which often negatively impacts functionality.

For example, in Go, a synchronous signal is converted into a runtime panic.

A simple way to handle these signals is to wait a few seconds for processing to complete. However, it may be necessary to close connections to databases, Redis, or a message broker, for example.

Graceful shutdown can be implemented directly in the service’s code. However, Kubernetes and Istio have configurations that can help in this task.

Go

The most common implementation of graceful shutdown in Go is using Goroutines and Channels, as shown in the example below.

In this example, the HTTP server is initialized in a new Goroutine while the main Goroutine waits for a signal on the quit channel.

Once a signal is received, the server is shut down with a timeout of 5 seconds. If there are still active connections after 5 seconds, the Shutdown() function returns an error.

Note

The Shutdown() function was introduced in Go1.8

The main web frameworks in Go suggest implementations following this pattern with Goroutines and Channels:

For those who prefer to use third-party libraries specifically created for this purpose, I would recommend ory/graceful.

Tip

An adaptation of the above example using the ory/graceful library is available on my GitHub.

Kubernetes

In Kubernetes Pods, it is possible to configure a hook called preStop, which is invoked before the SIGTERM signal is sent.

By setting a sleep command in this hook, we can achieve graceful shutdown, as shown in the example below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: web
      image: nginx
      ports:
        - name: web
          containerPort: 80
      lifecycle:
        preStop:
          exec:
            command: ["sleep", "15"]

The sleep interval should be sufficient for the Kubernetes endpoint change to propagate to kube-proxy, Ingress Controller, CoreDNS, etc. For more details, refer to this article.

By default, Kubernetes waits for up to 30 seconds during the Pod shutdown process before forcefully terminating the process (SIGKILL, which cannot be intercepted).

Tip

I strongly recommend reading this article for more details on Graceful Shutdown in Kubernetes.

Warning

The main disadvantage of this approach is that the Docker image needs to have the sleep command, making it difficult to use Distroless images.

Istio

Istio has the configuration TerminationDrainDuration, which allows defining a pause before shutting down the sidecar.

Info

Sidecar is a common concept in Service Mesh implementations. It is a container that accompanies the application (which is also a container) within the Kubernetes Pod. Thus, we have 2 containers inside the Pod: app + sidecar.

The sidecar is a proxy (in the case of Istio, it is Envoy) that intermediates all Pod traffic, providing all the advantages of the Service Mesh.

When the proxy receives SIGTERM or SIGINT, it starts draining connections, preventing new connections and allowing existing connections to complete.

Tip

Remember that SIGTERM is sent after the execution of the preStop hook.

The duration of this draining process is configurable both globally:

1
2
3
4
5
6
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    defaultConfig:
      terminationDrainDuration: 50s

and per workload:

1
2
annotations:
  proxy.istio.io/config: '{ "terminationDrainDuration": 50s }'

The default duration is 5 seconds.

Conclusion

The best approach to enable Graceful Shutdown depends on the scenario of each project/system.

There are projects where:

modifying the microservices’ code is very laborious;
Service Mesh is not used;
the use of Distroless images is a priority;
they are not running on Kubernetes;
they are running on Kubernetes, using Istio, and have the agility to modify the microservices’ code. In this case, it is possible to combine multiple strategies to ensure zero downtime.

Question

Share in the comments the challenges and lessons learned from your project! 😉

What is the scenario of your project?
What approach is used for graceful shutdown?
How is the implementation in your preferred programming language and framework?

Contents

Graceful Shutdown

Introduction

Understanding the signals

Graceful shutdown

Go

Kubernetes

Istio

Conclusion

References