Configure a MachineHealthCheck

TOC

Overview

A MachineHealthCheck is a resource within the Cluster API which allows users to define conditions under which Machines within a Cluster should be considered unhealthy. A MachineHealthCheck is defined on a management cluster and scoped to a particular workload cluster.

When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node. If any of these conditions are met for the duration of the timeout, the Machine will be remediated. By default, the action of remediating a Machine should trigger a new Machine to be created to replace the failed one, but providers are allowed to plug in more sophisticated external remediation solutions.

WARNING

MachineHealthCheck relies on Cluster API's rolling update mechanism. During a rolling update, any previously attached disks are removed and replaced with new disks on newly created machines. Ensure that no cluster functionality or workloads depend on data stored on the original disks.

Prerequisites

Before attempting to configure a MachineHealthCheck, you should have a working management cluster with at least one MachineDeployment or KubeadmControlPlane deployed.

Configure a MachineHealthCheck For a MachineDeployment

To configure a MachineHealthCheck for a MachineDeployment, you need to create a MachineHealthCheck resource in the management cluster.

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: <machinehealthcheck-name>
  namespace: cpaas-system
spec:
  clusterName: <cluster-name>
  nodeStartupTimeout: 30m
  selector:
    matchExpressions:
    - key: cluster.x-k8s.io/deployment-name
      operator: Exists
  unhealthyConditions:
  - type: Ready
    status: Unknown
    timeout: 600s
  - type: Ready
    status: "False"
    timeout: 600s

Configure a MachineHealthCheck For control plane nodes

---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: <control-plane-machinehealthcheck-name>
  namespace: cpaas-system
spec:
  clusterName: <cluster-name>
  maxUnhealthy: 1
  selector:
    matchLabels:
      cluster.x-k8s.io/control-plane: ""
  unhealthyConditions:
    - type: Ready
      status: Unknown
      timeout: 600s
    - type: Ready
      status: "False"
      timeout: 600s

Reference

Configure a MachineHealthCheck