Kubernetes, initially released in 2014, is an open source container orchestration system released under Apache License 2.0 and written in the programming language Go. Google originally created it, but today it’s maintained by the Cloud Native Computing Foundation (CNCF). Kubernetes offers incredible flexibility, allowing organizations to deploy, scale, and manage production grade, containerized workloads easily. That flexibility comes with great complexity and many unfamiliar terms and technologies for teams to understand. Liveness probes and readiness probes are two terms you should know as you deploy mature applications to Kubernetes. In this guide, we will discuss running a liveness probe in Kubernetes clusters as a type of health check to determine if a container is still running and responsive.
You want your applications to be reliable, but sometimes... they’re not. They may have failed due to configuration errors, application errors, or temporary connection issues. Although the reason the application became unreliable is important, it’s equally important to know that an issue has occurred or is occurring. Probes can help developers with troubleshooting by monitoring their applications for issues, but they can also help them to plan and manage resources by indicating when an application is experiencing resource contention.
Probes are periodic checks to monitor the health of an application and they are typically configured using the command-line client or a YAML deployment template. Developers can use either method to configure the probes. The three types of probes in K8s are:
Startup probes: These probes verify whether an application within a container has started. It is only executed at startup, and if a container fails this probe, the container is killed and follows the
restartPolicy for the pod. You can configure startup probes in the
spec.containers.startupProbe attribute for the pod configuration. A primary motivation for startup probes is that some legacy applications require additional startup time when first initialized, which can make setting liveness probe parameters tricky. When configuring a
startupProbe, use the same protocol as the application and ensure the
failureThreshold * periodSeconds is enough to cover the worst case startup time.
Readiness probes: These probes continuously verify whether a Docker container is ready to serve requests. If the probe returns a failed state, Kubernetes removes the IP address for the pod from the endpoints of all services. Readiness probes enable developers to instruct Kubernetes that a running container should not receive traffic until additional tasks are completed, such as loading files, warming caches, and establishing network connections. You can configure a readiness probe in the
spec.containers.readinessProbe attribute for the pod configuration. These probes run periodically as defined by the
Liveness probes: These probes help you evaluate whether an application that is running in a container is in a healthy state. If not, Kubernetes kills the container and attempts to redeploy it. These are useful when you want to ensure your application is not deadlocked or silently unresponsive. You can configure liveness probes in the
spec.containers.livenessProbecode> attribute of the pod configuration. Like readiness probes, liveness probes also run periodically. We will look at their details and configuration options below.
By design, Kubernetes automatically monitors pods throughout their lifecycle, restarting them when it detects failures on Process ID 1 (pid 1), the init process responsible for starting up and shutting down the system. That works great when your application crashes, because Kubernetes will terminate its process and send out a non-zero access code. Unfortunately, not all applications are the same and Kubernetes doesn’t always detect failures. For example, if your application lost its database connection, or if your application encounters timeouts when connecting with a third-party service, it might not recover on its own. In cases like this, the pods appear to kubelet to be running as expected but the end users will be unable to access the application.
These types of issues can be difficult to debug because at the container level everything is operating as expected. Liveness probes solve this problem because they communicate information about the internal states of your pods to Kubernetes, which means that your cluster will handle the problem instead of requiring manual monitoring and intervention. Liveness probes reduce your maintenance burden and make certain that your application is not silently failing.
Below are details on the available types of liveness probes in Kubernetes. Selecting the type of liveness probe that most closely aligns with your application’s architecture and accurately exposes the internal state of your application is critical for successful workloads deployed to Kubernetes.
1. Command execution liveness probe: This probe runs a command or script inside the container. If the command terminates with 0 as its exit code, it means the container is running as expected.
2. HTTP GET liveness probe: This probe sends an HTTP GET request to a URL in the container. If the container’s response includes an HTTP status code in the 200-399 range, it means the probe was successful.
You can set these additional fields on httpGet for your HTTP probe:
host: The host name to connect to. It defaults to the pod IP; instead, you may want to set "Host" in httpHeaders.
scheme: The scheme to use for connecting to the host (HTTP or HTTPS); it defaults to HTTP.
path: The path to access on the HTTP server; it defaults to /.
httpHeaders: The custom headers you can set in the request; HTTP allows repeated headers.
port: The name or number of the port to access on the container; this number must be in the 1 to 65535 range.
3. TCP Socket liveness probe: This probe attempts to connect to a specific TCP port inside the container. If the specified port is open, the probe is considered successful.
4. gRPC: applications that use gRPC can use gRCP health-check probes. This type of probe has been available since Kubernetes v1.23. If gRPC Health Checking Protocol is implemented, you can configure kubelet to use it for application liveness checks. You need to enable the
GRPCContainerProbe feature gate to configure checks that rely on gRPC. You must configure the port to use a gRPC probe, and if the health endpoint is configured on a non-default service, you also need to specify the service.
In production environments, including liveness probes as a part of your application deployment templates is considered a best practice. This way you can template and reuse your liveness probe configuration across similar applications.
When getting started, it’s best to deploy applications that are intended to test and demonstrate your liveness probe configuration in a similar application to what you plan on using in production. To illustrate this below, we will take an example image from registry.k8s.io/busybox and deploy a static pod with a liveness probe using the command execution method.
Applying this yaml in your cluster will deploy an example pod, which will succeed for the first 40 seconds, then intentionally enter a failed state, upon which the liveness probe will fail and kubelet will restart the container to restore service.
- name: liveness
- touch /tmp/healthz; sleep 40; rm -f /tmp/healthz; sleep 700
In the example above, the Pod has a single container. The fields and commands under the livenessProbe attribute specify how you want the kubelet to perform the health checks:
initialDelaySeconds: this field indicates how long the kubelet waits in seconds before performing the first probe; the default is 0 seconds, and the minimum value is 0. This should be set as low as possible without affecting application performance.
periodSeconds: this field specifies how often the kubelet performs a liveness probe in seconds; the default is 10 seconds, and the minimum value is 1.
timeoutSeconds: the number of seconds after which the probe times out; the default is 1 second and the minimum value is 1.
successThreshold: after a probe previously failed, this is the minimum number of consecutive successes required for a probe to be considered successful; the default is 1, the minimum value is 1, and it must be 1 for liveness and startup probes.
failureThreshold: if at least this number of probes have failed, Kubernetes determines that the application is unhealthy and triggers a restart for that container (for both startup or liveness probes). The kubelet uses the setting
terminationGracePeriodSeconds for that container as part of that threshold.
terminationGracePeriodSeconds: how long the kubelet must wait between triggering the shutdown of a failed container and forcing the container runtime to stop that container. By default, it inherits the Pod-level value for
terminationGracePeriodSeconds. If not specified, the default is 30 seconds and the minimum value is 1.
/tmp/healthy: the kubelet executes the command
cat /tmp/healthz to perform a probe in the target container. For the first 40 seconds of the life of the container, the command returns a success code. After that, it returns a failure code.
For HTTP and TCP probes, you can use a named port. An example is
port: http. Note that gRPC probes do not support named ports or custom hosts.
Successful liveness probes don’t impact the health of your cluster. The probed container keeps running, and a new probe is scheduled after the
periodSeconds delay. If you have a probe run too frequently though, it wastes your resources and can also have a negative impact on application performance. If your probes aren’t frequent enough, on the other hand, your containers may be running in an unhealthy state for extended periods of time before being addressed.
Use the fields and commands outlined above to fine tune your probes to your application. Once you know how long your liveness probe’s command, API request, or gRPC call requires to complete, you can use those values in your
timeoutSeconds (also, consider adding a small buffer period). Use the smallest value you can for simple, short-running probes. Intensive or long-running commands may require you to wait longer before repeating them and thus you will not have the most up-to-date view of the health of your containers.
Also ensure that the target of the probe command or HTTP request is independent of your main application. This ensures that it can report its status to kubelet even if your primary application fails. If your liveness probe is served by your standard application entry point, it could lead to inaccurate results if its framework fails or if it requests an external dependency that is unavailable.
Probes are impacted by restart policies: Container restart policies are applied after probes. Your containers must be set to
restartPolicy: Always (which is the default) or
restartPolicy: OnFailure to ensure that Kubernetes can restart the containers after a failed liveness probe. If you use the Never policy, your container will remain in the failed state indefinitely after a liveness probe fails.
Every container does not need a probe: You can omit simple containers that always terminate on failure and low-priority services.
Jobs that don’t run on pid 1 are a great example of a pod that will benefit from a liveness probe. If the job fails, we want it restarted to ensure the most recent run is successful and that there is not a queue of work being built up silently.
Reevaluate your probes: Don’t set it and forget it. New optimizations, features, and regressions in your application can impact your probe performance. Make sure you check your probes regularly and make adjustments as needed. The most accurate probes come from settings derived from real world metrics. Revisiting your probes after deployment allows you to adjust your probes according to historic performance.
Using liveness probes in Kube can help you improve your application availability because they give you ongoing insight into the health of the applications inside your containers. In some ways, Kubernetes can create a disconnect, because while your pods may appear healthy, your users may not actually be able to access your apps and services. Liveness probes help you verify that your applications, containers, and pods are all running as designed and ensure that K8s is restarting containers when they become unhealthy.
You can use open source tools, such as Polaris, to apply automation to audit and revise the YAML manifest of any issues it finds. In the case of liveness probes and readiness probes, Polaris may leave comments to prompt users to make changes appropriate to the context of their application. Here is a video of me walking through some of these basic examples on setting liveness probes across clusters to ensure reliability using Fairwinds Insights that you can use to get started.
Different types of health checks can help you perform liveness checks and readiness checks in Kubernetes. Liveness probes, readiness probes, and startup probes can all help you make sure that your Kubernetes services are built on a good foundation so your DevOps teams can deliver better reliability and higher uptime of your apps and services. If you’re having trouble getting started, check out this tutorial on the Kubernetes website.