2025-12-30

k8s

Kubernetes之Workloads基础介绍

Pods
Workload管理

Workload（工作负载）是一个你在 Kubernetes 上运行的应用程序。无论你的负载是由单个组件还是由多个一同工作的组件构成，你都可以在一组 Pod中运行它。在 Kubernetes 中，Pod 代表的是集群上处于运行状态的一组容器的集合。

这里第一次接触Workload这个概念可能会有点晕，这里从过来人的角度解释一下：Workload（工作负责）是一个逻辑的概念，它就是指你实际在 Kubernetes 上运行的应用程序，它在Kubernetes中不一个具体的资源类型，Kubernetes对它的抽象就是在Pod中的Container。如何部署和管理你的Workload，Kubernetes为此提供了Workload Resources，用来管理运行Workload，这里其实是管理Workload的抽象：Pods；

这里为什么Kubernetes会有Pods和Workload Resources两类抽象呢？其实原因很简单：

Kubernetes 的 Pod 具有明确的生命周期。例如，一旦某个 Pod 在集群中运行，而它所在的节点发生了严重故障，那么该节点上的所有 Pod 都会失效。Kubernetes 将这种级别的故障视为不可恢复：即使该节点之后恢复正常，你也需要创建一个新的 Pod 来恢复服务。

然而，为了大幅简化管理，你不需要直接管理每个 Pod。相反，你可以使用工作负载资源（workload resources），它们会代表你管理一组 Pod。这些资源会配置相应的控制器（controllers），确保运行着你指定数量和类型的 Pod，以匹配你所期望的状态。

所以本文关于Workloads的相关知识就从两个方面来阐述：

Pods：可以创建和部署的最小也是最简的单位，是Kubernetes 中对Workload的实际抽象的对象；
Workload Resources：概念上就是部署和管理你的Workload，具体就是用来管理Pods；

Pods

Kubernetes 中可以创建和部署的最小也是最简的单位。Pod 代表着集群中运行的进程。Pod中封装着一个或者多个应用容器(就像豌豆荚)，Pod中运行多个容器的情况一般是应用需要紧密协作的情况，例如基于SideCar的很多应用；

关于Pod的特性可以总结如下：

Pod 是 Kubernetes 调度和管理的原子单位。
一个 Pod 可包含一个或多个紧密耦合的容器，这些容器共享网络、存储、IPC 等命名空间，即同一个Pod所有的容器都共享相关资源的Namespace，隔离Pod和主机的资源，具体关于Namespace可以参考：容器化技术之Linux Namespace。
所有容器在同一个 Pod 中共存亡：Pod 被删除，所有容器都终止。

Pod 的共享上下文是一组 Linux 命名空间（namespaces）、控制组（cgroups），以及可能其他形式的隔离机制——这些正是用于隔离容器的底层技术。在 Pod 的上下文中，各个应用程序可能还会应用更细粒度的子隔离机制。

Kubernetes 集群中的 Pod 主要用于两种方式：

运行单个容器的 Pod。
“每 Pod 一个容器” 是 Kubernetes 最常见的使用场景；在这种情况下，你可以将 Pod 视为单个容器的封装层；Kubernetes 管理的是 Pod，而不是直接管理容器。
运行多个需要协同工作的容器的 Pod。
一个 Pod 可以封装由多个紧密耦合、共置（co-located） 的容器组成的应用，这些容器需要共享资源。这些共置的容器共同构成一个逻辑上统一的整体。

将多个共置且共管的容器组合在一个 Pod 中，属于相对高级的使用场景。你仅应在特定情况下使用这种模式，即你的容器之间高度耦合、必须共享资源或生命周期。

如下你可能有一个容器，为共享卷中的文件提供 Web 服务器支持，以及一个单独的（Sidercar）容器负责从远端更新这些文件，如下图所示：

Pods的使用

你几乎不会在 Kubernetes 中直接创建单个 Pod —— 即使是只运行一个容器的单例 Pod。这是因为 Pod 被设计成相对短暂的、可丢弃的实体。当一个 Pod 被创建时（无论是由你直接创建，还是由某个控制器间接创建），Kubernetes 会将这个新的 Pod 调度到集群中的某个 Node 上运行。该 Pod 会一直留在那个 Node 上，直到满足以下任意一种情况：

Pod 中的容器执行完成（termination）
Pod 对象被删除
由于节点资源不足，Pod 被驱逐（evicted）
Node 本身发生故障

注意：不要把「重启 Pod 里的某个容器」和「重启 Pod」混为一谈。
Pod 本身不是一个进程，而是一个（或多个）容器运行的环境。Pod 会一直存在，直到它被删除为止。容器可以被 kubelet 重启（比如因为崩溃），但 Pod 本身不会被“重启”——如果 Pod 被删了，就会创建一个全新的 Pod（名称、IP、存储卷等都会变）。

如下是源码关于Pod结构的定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/core/v1/types.go

// Pod is a collection of containers that can run on a host. This resource is created
// by clients and scheduled onto hosts.
type Pod struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

	// Specification of the desired behavior of the pod.
	Spec PodSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
	Status PodStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
}

PosSpec定义如下：

// PodSpec is a description of a pod.
type PodSpec struct {
	// List of volumes that can be mounted by containers belonging to the pod.
	Volumes []Volume `json:"volumes,omitempty" patchStrategy:"merge,retainKeys" patchMergeKey:"name" protobuf:"bytes,1,rep,name=volumes"`
	// List of initialization containers belonging to the pod.
	// Init containers are executed in order prior to containers being started. 
	InitContainers []Container `json:"initContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,20,rep,name=initContainers"`
	// List of containers belonging to the pod.
	// Containers cannot currently be added or removed.
	// There must be at least one container in a Pod.
	Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`
	// List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing
	// pod to perform user-initiated actions such as debugging. This list cannot be specified when
	// creating a pod, and it cannot be modified by updating the pod spec. In order to add an
	// ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource.
	EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,34,rep,name=ephemeralContainers"`
	// Restart policy for all containers within the pod.
	// One of Always, OnFailure, Never. In some contexts, only a subset of those values may be permitted.
	// Default to Always.
	RestartPolicy `json:"restartPolicy,omitempty" protobuf:"bytes,3,opt,name=restartPolicy,casttype=RestartPolicy"`
	// Optional duration in seconds the pod needs to terminate gracefully. May be decreased in delete request.
	// Value must be non-negative integer. The value zero indicates stop immediately via
	// the kill signal (no opportunity to shut down).
	TerminationGracePeriodSeconds *int64 `json:"terminationGracePeriodSeconds,omitempty" protobuf:"varint,4,opt,name=terminationGracePeriodSeconds"`
	// Optional duration in seconds the pod may be active on the node relative to
	// StartTime before the system will actively try to mark it failed and kill associated containers.
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty" protobuf:"varint,5,opt,name=activeDeadlineSeconds"`
	// Set DNS policy for the pod.
	// Defaults to "ClusterFirst".
	DNSPolicy DNSPolicy `json:"dnsPolicy,omitempty" protobuf:"bytes,6,opt,name=dnsPolicy,casttype=DNSPolicy"`
	// NodeSelector is a selector which must be true for the pod to fit on a node.
	// Selector which must match a node's labels for the pod to be scheduled on that node.
	NodeSelector map[string]string `json:"nodeSelector,omitempty" protobuf:"bytes,7,rep,name=nodeSelector"`

	// ServiceAccountName is the name of the ServiceAccount to use to run this pod.
	ServiceAccountName string `json:"serviceAccountName,omitempty" protobuf:"bytes,8,opt,name=serviceAccountName"`
	// Deprecated: Use serviceAccountName instead.
	DeprecatedServiceAccount string `json:"serviceAccount,omitempty" protobuf:"bytes,9,opt,name=serviceAccount"`
	// AutomountServiceAccountToken indicates whether a service account token should be automatically mounted.
	AutomountServiceAccountToken *bool `json:"automountServiceAccountToken,omitempty" protobuf:"varint,21,opt,name=automountServiceAccountToken"`
	// NodeName indicates in which node this pod is scheduled.
	NodeName string `json:"nodeName,omitempty" protobuf:"bytes,10,opt,name=nodeName"`
	// Host networking requested for this pod. Use the host's network namespace.
	HostNetwork bool `json:"hostNetwork,omitempty" protobuf:"varint,11,opt,name=hostNetwork"`
	// Use the host's pid namespace.
	// Optional: Default to false.
	HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"`
	// Use the host's ipc namespace.
	// Optional: Default to false.
	HostIPC bool `json:"hostIPC,omitempty" protobuf:"varint,13,opt,name=hostIPC"`
	// Share a single process namespace between all of the containers in a pod.
	ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,27,opt,name=shareProcessNamespace"`
	// SecurityContext holds pod-level security attributes and common container settings.
	// Optional: Defaults to empty.  See type description for default values of each field.
	SecurityContext *PodSecurityContext `json:"securityContext,omitempty" protobuf:"bytes,14,opt,name=securityContext"`
	// ImagePullSecrets is an optional list of references to secrets in the same namespace to use for pulling any of the images used by this PodSpec.
	ImagePullSecrets []LocalObjectReference `json:"imagePullSecrets,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,15,rep,name=imagePullSecrets"`
	// Specifies the hostname of the Pod
	Hostname string `json:"hostname,omitempty" protobuf:"bytes,16,opt,name=hostname"`
	// If specified, the fully qualified Pod hostname will be "<hostname>.<subdomain>.<pod namespace>.svc.<cluster domain>".
	Subdomain string `json:"subdomain,omitempty" protobuf:"bytes,17,opt,name=subdomain"`
	// If specified, the pod's scheduling constraints
	// +optional
	Affinity *Affinity `json:"affinity,omitempty" protobuf:"bytes,18,opt,name=affinity"`
	SchedulerName string `json:"schedulerName,omitempty" protobuf:"bytes,19,opt,name=schedulerName"`
	Tolerations []Toleration `json:"tolerations,omitempty" protobuf:"bytes,22,opt,name=tolerations"`
	// HostAliases is an optional list of hosts and IPs that will be injected into the pod's hosts
	// file if specified.
	HostAliases []HostAlias `json:"hostAliases,omitempty" patchStrategy:"merge" patchMergeKey:"ip" protobuf:"bytes,23,rep,name=hostAliases"`
	PriorityClassName string `json:"priorityClassName,omitempty" protobuf:"bytes,24,opt,name=priorityClassName"`
	Priority *int32 `json:"priority,omitempty" protobuf:"bytes,25,opt,name=priority"`
	DNSConfig *PodDNSConfig `json:"dnsConfig,omitempty" protobuf:"bytes,26,opt,name=dnsConfig"`
	// If specified, all readiness gates will be evaluated for pod readiness.
	ReadinessGates []PodReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,28,opt,name=readinessGates"`
	// RuntimeClassName refers to a RuntimeClass object in the node.k8s.io group, which should be used
	// to run this pod.  If no RuntimeClass resource matches the named class, the pod will not be run.
	RuntimeClassName *string `json:"runtimeClassName,omitempty" protobuf:"bytes,29,opt,name=runtimeClassName"`
	// EnableServiceLinks indicates whether information about services should be injected into pod's
	EnableServiceLinks *bool `json:"enableServiceLinks,omitempty" protobuf:"varint,30,opt,name=enableServiceLinks"`
	// PreemptionPolicy is the Policy for preempting pods with lower priority.
	PreemptionPolicy *PreemptionPolicy `json:"preemptionPolicy,omitempty" protobuf:"bytes,31,opt,name=preemptionPolicy"`
	Overhead ResourceList `json:"overhead,omitempty" protobuf:"bytes,32,opt,name=overhead"`
	TopologySpreadConstraints []TopologySpreadConstraint `json:"topologySpreadConstraints,omitempty" patchStrategy:"merge" patchMergeKey:"topologyKey" protobuf:"bytes,33,opt,name=topologySpreadConstraints"`
	// If true the pod's hostname will be configured as the pod's FQDN, rather than the leaf name (the default).
	SetHostnameAsFQDN *bool `json:"setHostnameAsFQDN,omitempty" protobuf:"varint,35,opt,name=setHostnameAsFQDN"`
	// Specifies the OS of the containers in the pod.
	// Some pod and container fields are restricted if this is set.
	OS *PodOS `json:"os,omitempty" protobuf:"bytes,36,opt,name=os"`
	HostUsers *bool `json:"hostUsers,omitempty" protobuf:"bytes,37,opt,name=hostUsers"`
	SchedulingGates []PodSchedulingGate `json:"schedulingGates,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,38,opt,name=schedulingGates"`
	ResourceClaims []PodResourceClaim `json:"resourceClaims,omitempty" patchStrategy:"merge,retainKeys" patchMergeKey:"name" protobuf:"bytes,39,rep,name=resourceClaims"`
	Resources *ResourceRequirements `json:"resources,omitempty" protobuf:"bytes,40,opt,name=resources"`
	HostnameOverride *string `json:"hostnameOverride,omitempty" protobuf:"bytes,41,opt,name=hostnameOverride"`
	WorkloadRef *WorkloadReference `json:"workloadRef,omitempty" protobuf:"bytes,42,opt,name=workloadRef"`
}

如下是直接创建Pod的方式：

#https://k8s.io/examples/pods/simple-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80

然后通过kubctl进行指令声明：kubectl apply -f https://k8s.io/examples/pods/simple-pod.yaml进行Pod的直接创建；

Pod 模板

前一节我们直接通过Imperative object configuration的方式直接创建了单个Pod，但是现实基本不会直接这个创建Pod，实际业务中，我们都使用Workload Resources工作负载资源（例如Deployment，StatefulSet等）来创建和管理多个 Pod。资源的控制器能够处理副本的管理、上线，并在 Pod 失效时提供自愈能力。例如，如果一个节点失败，控制器注意到该节点上的 Pod 已经停止工作，就可以创建替换性的 Pod。调度器会将替身 Pod 调度到一个健康的节点执行。

Workload Resources的控制器通常使用 Pod 模板（Pod Template） 来替你创建 Pod 并管理它们。Pod 模板是包含在工作负载对象中的规范，用来创建 Pod。这类负载资源包括 Deployment、 Job 和 DaemonSet 等。

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/core/v1/types.go

// PodTemplateSpec describes the data a pod should have when created from a template
type PodTemplateSpec struct {
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

	// Specification of the desired behavior of the pod.
	Spec PodSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
}

// PodTemplate describes a template for creating copies of a predefined pod.
type PodTemplate struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

	// Template defines the pods that will be created from this pod template.
	Template PodTemplateSpec `json:"template,omitempty" protobuf:"bytes,2,opt,name=template"`
}

这里我们看一下Workload Resources中的Job负载对象的结构定义如下：其中就有PodTemplate结构的定义：

https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/batch/v1/types.go
// Job represents the configuration of a single job.
type Job struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`

	// Specification of the desired behavior of a job.
	Spec JobSpec `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
	// Current status of a job.
	Status JobStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
}

type JobSpec struct {
	// Specifies the maximum desired number of pods the job should
	// run at any given time. The actual number of pods running in steady state will
	// be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism),
	// +optional
	Parallelism *int32 `json:"parallelism,omitempty" protobuf:"varint,1,opt,name=parallelism"`

	// Specifies the desired number of successfully finished pods the
	// job should be run with.  Setting to null means that the success of any
	// pod signals the success of all pods, and allows parallelism to have any positive
	// value.  Setting to 1 means that parallelism is limited to 1 and the success of that
	// pod signals the success of the job.
	Completions *int32 `json:"completions,omitempty" protobuf:"varint,2,opt,name=completions"`
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty" protobuf:"varint,3,opt,name=activeDeadlineSeconds"`
	PodFailurePolicy *PodFailurePolicy `json:"podFailurePolicy,omitempty" protobuf:"bytes,11,opt,name=podFailurePolicy"`
	SuccessPolicy *SuccessPolicy `json:"successPolicy,omitempty" protobuf:"bytes,16,opt,name=successPolicy"`
	BackoffLimit *int32 `json:"backoffLimit,omitempty" protobuf:"varint,7,opt,name=backoffLimit"`
	BackoffLimitPerIndex *int32 `json:"backoffLimitPerIndex,omitempty" protobuf:"varint,12,opt,name=backoffLimitPerIndex"`
	MaxFailedIndexes *int32 `json:"maxFailedIndexes,omitempty" protobuf:"varint,13,opt,name=maxFailedIndexes"`
	Selector *metav1.LabelSelector `json:"selector,omitempty" protobuf:"bytes,4,opt,name=selector"`
	ManualSelector *bool `json:"manualSelector,omitempty" protobuf:"varint,5,opt,name=manualSelector"`

	// Describes the pod that will be created when executing a job.
	// The only allowed template.spec.restartPolicy values are "Never" or "OnFailure".
	// More info: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
	Template corev1.PodTemplateSpec `json:"template" protobuf:"bytes,6,opt,name=template"`

	TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty" protobuf:"varint,8,opt,name=ttlSecondsAfterFinished"`
	CompletionMode *CompletionMode `json:"completionMode,omitempty" protobuf:"bytes,9,opt,name=completionMode,casttype=CompletionMode"`
	Suspend *bool `json:"suspend,omitempty" protobuf:"varint,10,opt,name=suspend"`
	PodReplacementPolicy *PodReplacementPolicy `json:"podReplacementPolicy,omitempty" protobuf:"bytes,14,opt,name=podReplacementPolicy,casttype=podReplacementPolicy"`
	ManagedBy *string `json:"managedBy,omitempty" protobuf:"bytes,15,opt,name=managedBy"`
}

如下是通过Job Workload Resource来进行Pod的创建的方式：

apiVersion: batch/v1
kind: Job
metadata:
  name: hello
spec:
  template:
    # 这里是 Pod 模板
    spec:
      containers:
      - name: hello
        image: nginx
        command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
      restartPolicy: OnFailure
    # 以上为 Pod 模板

如下通过Job创建Pod资源过程：

$ kubectl apply -f test.yaml 
job.batch/hello created

$ kubectl get all
NAME                                       READY   STATUS    RESTARTS   AGE
pod/hello-gjszv                            1/1     Running   0          33s

NAME              STATUS    COMPLETIONS   DURATION   AGE
job.batch/hello   Running   0/1           35s        35s

当某工作负载的 Pod 模板被改变时，控制器会基于更新后的模板创建新的 Pod 对象，而不是对现有 Pod 执行更新或者修补操作。Kubernetes 并不禁止你直接管理 Pod。对运行中的 Pod 的某些字段执行原地更新操作还是可能的。不过，类似 patch 和 replace 这类更新操作有一些限制：

Pod 的绝大多数元数据都是不可变的。例如，你不可以改变其 namespace、name、 uid 或者 creationTimestamp 字段。
如果 metadata.deletionTimestamp 已经被设置，则不可以向 metadata.finalizers 列表中添加新的条目。
Pod 更新不可以改变除 spec.initContainers[*].image、spec.activeDeadlineSeconds、 spec.terminationGracePeriodSeconds、spec.tolerations 或 spec.schedulingGates 之外的字段。对于 spec.tolerations，你只被允许添加新的条目到其中。
在更新 spec.activeDeadlineSeconds 字段时，以下两种更新操作是被允许的：
1. 如果该字段尚未设置，可以将其设置为一个正数；
2. 如果该字段已经设置为一个正数，可以将其设置为一个更小的、非负的整数。

上面描述了常规的原地更新的限制，但Kubernetes 为某些高级场景提供了 Pod subresources API，可更新普通更新中不允许改的字段，如下：

调整大小： resize 子资源允许更新容器资源（spec.containers[*].resources）。更多详情参见调整容器资源大小。
临时容器： ephemeralContainers 子资源允许临时容器被添加到一个 Pod 中。更多详情参见临时容器。
状态： status 子资源允许更新 Pod 状态。这通常仅由 kubelet 和其他系统控制器使用。
绑定： binding 子资源允许通过 Binding 请求设置 Pod 的 spec.nodeName。这通常仅由调度器使用。

如下是通过Pod subresources API来进行调整Pod的CPU，而不重启：

apiVersion: v1
kind: Pod
metadata:
  name: resize-demo
spec:
  containers:
  - name: pause
    image: registry.k8s.io/pause:3.8
    resizePolicy:
    - resourceName: cpu
      restartPolicy: NotRequired # Default, but explicit here
    - resourceName: memory
      restartPolicy: RestartContainer
    resources:
      limits:
        memory: "200Mi"
        cpu: "700m"
      requests:
        memory: "200Mi"
        cpu: "700m"

将 CPU 请求和限制增加到 800m。使用带有 --subresource resize 命令行参数的 kubectl patch。

1 2	$ kubectl patch pod resize-demo --subresource resize --patch \ '{"spec":{"containers":[{"name":"pause", "resources":{"requests":{"cpu":"800m"}, "limits":{"cpu":"800m"}}}]}}'

Static Pods

Static Pods 是由节点上的 kubelet 守护进程直接管理的Pods，特点如下：

在 Kubernetes 中，绝大多数 Pod 是由 控制平面（如 Deployment、StatefulSet 等控制器）创建和管理的，这些 Pod 的生命周期由 API Server 和控制器协同管理。
而 Static Pods 则不同：它们 不由 API Server 管理，而是由 节点上的 kubelet 守护进程直接创建和监控。

如果 Static Pod 崩溃或退出，kubelet 会自动重启它，就像普通 Pod 一样具有自愈能力，但这个过程完全发生在节点本地，不依赖控制平面。每个 Static Pod 只能运行在它被定义的那个节点上，无法被调度到其他节点。因为它们不是通过 API Server 创建的，所以 调度器（Scheduler）不会参与 Static Pod 的调度过程。

那为什么要有Static Pods这个概念呢，原因是：为了运行自托管的控制平面。

Static Pods 最常见的用途是 在 Kubernetes 集群中运行控制平面组件本身（如 kube-apiserver、kube-controller-manager、kube-scheduler 等）。
这种方式称为 “自托管控制平面”（self-hosted control plane）：即使用 kubelet 来管理控制平面组件的生命周期。

虽然 Static Pod 本身不由 API Server 管理，但 kubelet 会自动在 API Server 上创建一个对应的Mirror Pod（镜像 Pod）,这个镜像 Pod 只用于展示（让你能在 kubectl get pods 中看到该 Pod），不能被修改或删除（比如执行 kubectl delete pod 不会真正删除 Static Pod）。真正的 Static Pod 只能通过 删除其在节点上的配置文件 来移除。

关于Static Pods的一个很重要限制：

Static Pod的spec不能引用其他 API 对象

例如Static Pod 的 YAML/JSON 配置文件 不能引用集群中的其他 API 对象，例如：ServiceAccount，ConfigMap，Secret，PersistentVolumeClaim 等：

原因：Static Pod 是 kubelet 本地读取配置文件（通常在 /etc/kubernetes/manifests/ 目录下）直接启动的，它创建时 API Server 可能还没启动，或者 kubelet 无法解析这些引用。
因此，Static Pod 的配置必须是 完全自包含的（self-contained），比如 Secret 内容需要直接写在 Pod spec 的环境变量或 volume 中（不推荐，有安全风险），或者通过主机路径挂载配置文件。

关于Static Pod，在一个 Kubernetes master 节点中，你会在 /etc/kubernetes/manifests/ 目录下看到几个 YAML 文件，比如：

1 2	$ ls /etc/kubernetes/manifests/ etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml

这些就是 Static Pod 的定义文件。kubelet 自动读取它们并启动对应的容器，即使整个 Kubernetes 控制平面还没完全就绪。

当你修改了Static Pod的配置文件，例如 kube-apiserver.yaml，kubelet 会自动检测到文件内容的变化，并相应地更新或重启对应的 Static Pod。涉及到重建Pod的，它的更新方式是 “先删除旧 Pod，再创建新 Pod”（即 recreate 策略）。

我们也可以在Node的在 /etc/kubernetes/manifests/ 目录下创建自己的Static Pod的yaml配置，然后kubelet会自动发现并进行Pod的创建，如下：

# /etc/kubernetes/manifests/test.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.14.2
    ports:
    - containerPort: 80

然后可以看到自动创建的Pod如下：

1
2
3

$ kubectl get pods
NAME                 READY   STATUS    RESTARTS   AGE
nginx-controlplane   1/1     Running   0          2m46s

Pod生命周期

Pod 是短暂的（Ephemeral）：

Pod 不是“永久”资源。它被设计为可替换、一次性的单元。
每个 Pod 有唯一 UID，但一旦被删除，即使重建同名 Pod，也是全新的实体（UID 不同）。
如果节点宕机，其上的所有 Pod 会被自动清理（因为无法恢复），由上层控制器（如 Deployment）负责重新创建新 Pod。

✅ 这体现了 Kubernetes 的“声明式 + 自愈”设计哲学：不修复旧 Pod，而是创建新 Pod。

Pod 的生命周期由 kubelet 和控制平面共同管理

kubelet 负责节点本地容器的重启（比如容器崩溃了，按 restartPolicy 重启）。
控制平面（API Server + Controller Manager）负责在节点失效时清理 Pod 记录。
Pod 的健康状态通过 容器状态（Waiting / Running / Terminated）和 Pod 条件（如 Ready, Initialized）来反映。

一个Pod 只调度一次（Single Scheduling） ：

一旦 Pod 被调度器（Scheduler）选中某个节点，就永远绑定到该节点。
即使节点后来资源不足、网络断开，Kubernetes 不会将 Pod 迁移到其他节点。
如果节点在 Pod 启动前就宕机，这个 Pod 永远不会运行，只能靠上层控制器（如 ReplicaSet）创建新 Pod 来替代。

⚠️ 这就是为什么不应该直接创建裸 Pod（bare Pod）——它没有自愈能力！应使用 Deployment、StatefulSet 等控制器。

Pod生命周期的各个状态都定义在对象的Status结构中，针对Pod就是PosStatus，下面是PodStatus的结构定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/core/v1/types.go

type PodStatus struct {
	ObservedGeneration int64 `json:"observedGeneration,omitempty" protobuf:"varint,17,opt,name=observedGeneration"`
	// The phase of a Pod is a simple, high-level summary of where the Pod is in its lifecycle.
	Phase PodPhase `json:"phase,omitempty" protobuf:"bytes,1,opt,name=phase,casttype=PodPhase"`
	// Current service state of pod.
	Conditions []PodCondition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,2,rep,name=conditions"`
	// A human readable message indicating details about why the pod is in this condition.
	Message string `json:"message,omitempty" protobuf:"bytes,3,opt,name=message"`
	// A brief CamelCase message indicating details about why the pod is in this state.
	Reason string `json:"reason,omitempty" protobuf:"bytes,4,opt,name=reason"`
	
	NominatedNodeName string `json:"nominatedNodeName,omitempty" protobuf:"bytes,11,opt,name=nominatedNodeName"`

	// hostIP holds the IP address of the host to which the pod is assigned. Empty if the pod has not started yet.
	HostIP string `json:"hostIP,omitempty" protobuf:"bytes,5,opt,name=hostIP"`
	HostIPs []HostIP `json:"hostIPs,omitempty" protobuf:"bytes,16,rep,name=hostIPs" patchStrategy:"merge" patchMergeKey:"ip"`

	// podIP address allocated to the pod. Routable at least within the cluster.
	PodIP string `json:"podIP,omitempty" protobuf:"bytes,6,opt,name=podIP"`
	PodIPs []PodIP `json:"podIPs,omitempty" protobuf:"bytes,12,rep,name=podIPs" patchStrategy:"merge" patchMergeKey:"ip"`

	StartTime *metav1.Time `json:"startTime,omitempty" protobuf:"bytes,7,opt,name=startTime"`

	// Statuses of init containers in this pod.
	InitContainerStatuses []ContainerStatus `json:"initContainerStatuses,omitempty" protobuf:"bytes,10,rep,name=initContainerStatuses"`

	// Statuses of containers in this pod.
	ContainerStatuses []ContainerStatus `json:"containerStatuses,omitempty" protobuf:"bytes,8,rep,name=containerStatuses"`

	QOSClass PodQOSClass `json:"qosClass,omitempty" protobuf:"bytes,9,rep,name=qosClass"`

	// Statuses for any ephemeral containers that have run in this pod.
	EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,13,rep,name=ephemeralContainerStatuses"`

	// Status of resources resize desired for pod's containers.
	Resize PodResizeStatus `json:"resize,omitempty" protobuf:"bytes,14,opt,name=resize,casttype=PodResizeStatus"`

	// Status of resource claims.
	ResourceClaimStatuses []PodResourceClaimStatus `json:"resourceClaimStatuses,omitempty" patchStrategy:"merge,retainKeys" patchMergeKey:"name" protobuf:"bytes,15,rep,name=resourceClaimStatuses"`

	ExtendedResourceClaimStatus *PodExtendedResourceClaimStatus `json:"extendedResourceClaimStatus,omitempty" protobuf:"bytes,18,opt,name=extendedResourceClaimStatus"`
}

Pod的Phase阶段

下面我们看一下PodStatus的生命周期的各个阶段：

Pod的PodStatus中包含一个phase阶段，该阶段是对Pod在其生命周期中所处位置的简单描述，不是对Pod或者容器的综合汇总，也不是为了成为完整的状态机；

如下是phase各个阶段的值，及其定义如下：

取值	描述
`Pending`（悬决）	Pod 已被 Kubernetes 系统接受，但有一个或者多个容器尚未创建亦未运行。此阶段包括等待 Pod 被调度的时间和通过网络下载镜像的时间。
`Running`（运行中）	Pod 已经绑定到了某个节点，Pod 中所有的容器都已被创建。至少有一个容器仍在运行，或者正处于启动或重启状态。
`Succeeded`（成功）	Pod 中的所有容器都已成功终止，并且不会再重启。
`Failed`（失败）	Pod 中的所有容器都已终止，并且至少有一个容器是因为失败终止。也就是说，容器以非 0 状态退出或者被系统终止。
`Unknown`（未知）	因为某些原因无法取得 Pod 的状态。这种情况通常是因为与 Pod 所在主机通信失败。

当一个 Pod 被删除时，执行一些 kubectl 命令会展示这个 Pod 的状态为 Terminating（终止）。这个 Terminating 状态并不是 Pod 阶段之一。 Pod 被赋予一个可以体面终止的期限，默认为 30 秒。你可以使用 --force 参数来强制终止 Pod。

Pod的Conditions状态

Pod的PodStatus中包含一个Conditions []PodCondition的数组，用来描述 Pod 当前状态的一组条件（Conditions）。每个条件提供有关 Pod 是否满足某些特定要求或状态的信息，帮助用户和系统理解 Pod 是否正常运行、是否准备好接收流量、是否被调度等。Pod 可能通过也可能未通过其中的一些状况测试。

我们看一下PodCondition结构的定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/core/v1/types.go

// PodCondition contains details for the current condition of this pod.
type PodCondition struct {
	// Type is the type of the condition.
	Type PodConditionType `json:"type" protobuf:"bytes,1,opt,name=type,casttype=PodConditionType"`
	ObservedGeneration int64 `json:"observedGeneration,omitempty" protobuf:"varint,7,opt,name=observedGeneration"`
	// Status is the status of the condition.
	// Can be True, False, Unknown.
	Status ConditionStatus `json:"status" protobuf:"bytes,2,opt,name=status,casttype=ConditionStatus"`
	// Last time we probed the condition.
	LastProbeTime metav1.Time `json:"lastProbeTime,omitempty" protobuf:"bytes,3,opt,name=lastProbeTime"`
	// Last time the condition transitioned from one status to another.
	LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty" protobuf:"bytes,4,opt,name=lastTransitionTime"`
	// Unique, one-word, CamelCase reason for the condition's last transition.
	Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"`
	// Human-readable message indicating details about last transition.
	Message string `json:"message,omitempty" protobuf:"bytes,6,opt,name=message"`
}

type PodConditionType string

我们先看一下PodCondition结构的各个字段的含义：

字段名称	描述
`type`	Pod Condition的类型名称
`status`	表明该状况是否适用，可能的取值有 “`True`“、”`False`“ 或 “`Unknown`“
`lastProbeTime`	上次探测 Pod 状况时的时间戳
`lastTransitionTime`	Pod 上次从一种状态转换到另一种状态时的时间戳
`reason`	机器可读的、驼峰编码（UpperCamelCase）的文字，表述上次状况变化的原因
`message`	人类可读的消息，给出上次状态转换的详细信息

其中PodConditionType（一个字符串类型）的状态类型名及其含义如下：

PodScheduled：Pod 已经被调度到某节点；
PodReadyToStartContainers：Pod 沙箱被成功创建并且配置了网络（Beta 特性，默认启用）；
ContainersReady：Pod 中所有容器都已就绪；
Initialized：所有的 Init 容器都已成功完成；
Ready：Pod 可以为请求提供服务，并且应该被添加到对应服务的负载均衡池中。
DisruptionTarget：由于干扰（例如抢占、驱逐或垃圾回收），Pod 即将被终止。
PodResizePending：已请求对 Pod 进行调整大小，但尚无法应用。详见 Pod 调整大小状态。
PodResizeInProgress：Pod 正在调整大小中。详见 Pod 调整大小状态。

如下是一个Pod的status的contidions信息：

# kubectl get pod wecom-read-it-later-7c58678d5b-r6n9w -o yaml
apiVersion: v1
kind: Pod
metadata:
	...
spec:
  containers:
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2025-11-11T16:04:40Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2025-11-11T16:04:29Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2025-11-11T16:04:40Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2025-11-11T16:04:40Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2025-11-11T16:04:29Z"
    status: "True"
    type: PodScheduled
  ...

Pod的readiness就绪态

前面介绍了Pod的PodStatus中包含一个Conditions的数组，用来描述 Pod 当前状态的一组条件。我们可以设置Pod的Spec中的readinessGates就绪态门控列表，来关注Pod的Conditions中的对应的CondtionType是否就绪，以设置Pod的就绪态。

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/core/v1/types.go

type PodSpec {
  ...
  // If specified, all readiness gates will be evaluated for pod readiness.
	// A pod is ready when all its containers are ready AND
	// all conditions specified in the readiness gates have status equal to "True"
	ReadinessGates []PodReadinessGate `json:"readinessGates,omitempty" protobuf:"bytes,28,opt,name=readinessGates"`
  ...
}

type PodReadinessGate struct {
	// ConditionType refers to a condition in the pod's condition list with matching type.
	ConditionType PodConditionType `json:"conditionType" protobuf:"bytes,1,opt,name=conditionType,casttype=PodConditionType"`
}

type PodConditionType string

readinessGates（就绪态门控）基于 Pod 的 status.conditions 字段中对应conditionType的当前值来做决定。如果 Kubernetes 无法在 status.conditions 字段中找到readinessGates中对应的某ConditionType，则该状况的状态值默认为 “False“。

kind: Pod
...
spec:
  readinessGates:
    - conditionType: "www.example.com/feature-1"
status:
  conditions:
    - type: Ready                              # a built-in PodCondition
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z
    - type: "www.example.com/feature-1"        # an extra PodCondition
      status: "False"
      lastProbeTime: null
      lastTransitionTime: 2018-01-01T00:00:00Z
  containerStatuses:
    - containerID: docker://abcd...
      ready: true
...

kubectl patch 命令不支持对对象的 status（状态）字段进行打补丁操作。若要设置 Pod 的 status.conditions，应用程序和 Operator 应使用 PATCH 操作，为 Pod 的就绪状态（readiness）设置自定义条件。

对于使用了自定义就绪条件的 Pod，只有在同时满足以下两个条件时，该 Pod 才会被判定为“就绪”（Ready）：

Pod 中的所有容器都处于就绪状态（Ready）。
在 readinessGates 中指定的所有自定义条件的值均为 True。

当 Pod 的容器已就绪，但至少有一个自定义条件缺失或值为 False 时，kubelet 会将该 Pod 的状态设置为 ContainersReady（容器已就绪），但不会将其标记为整体就绪（Ready）。

对于带有 Init 容器的 Pod，kubelet 会在 Init 容器成功完成后将 Initialized 状况设置为 True （这发生在运行时成功创建沙箱和配置网络之后），对于没有 Init 容器的 Pod，kubelet 会在创建沙箱和网络配置开始之前将 Initialized 状况设置为 True。

Pod中容器的状态

Kubernetes会跟踪Pod中每个容器的状态，就像跟踪Pod的每个Phase一样。我们可以通过container lifecycle hooks来跟踪Container各个阶段。如下是PodStatus中的各个容器的ContainerStatus结构的定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/core/v1/types.go

type PodStatus struct {
  ...
	// Statuses of init containers in this pod.
	InitContainerStatuses []ContainerStatus `json:"initContainerStatuses,omitempty" protobuf:"bytes,10,rep,name=initContainerStatuses"`

	// Statuses of containers in this pod.
	ContainerStatuses []ContainerStatus `json:"containerStatuses,omitempty" protobuf:"bytes,8,rep,name=containerStatuses"`
  ...
}

// ContainerStatus contains details for the current status of this container.
type ContainerStatus struct {
	Name string `json:"name" protobuf:"bytes,1,opt,name=name"`
	// State holds details about the container's current condition.
	State ContainerState `json:"state,omitempty" protobuf:"bytes,2,opt,name=state"`
	LastTerminationState ContainerState `json:"lastState,omitempty" protobuf:"bytes,3,opt,name=lastState"`
	// Ready specifies whether the container is currently passing its readiness check.
	// The value will change as readiness probes keep executing. If no readiness
	// probes are specified, this field defaults to true once the container is
	// fully started (see Started field).
	//
	// The value is typically used to determine whether a container is ready to
	// accept traffic.
	Ready bool `json:"ready" protobuf:"varint,4,opt,name=ready"`
	// RestartCount holds the number of times the container has been restarted.
	RestartCount int32 `json:"restartCount" protobuf:"varint,5,opt,name=restartCount"`
	// Image is the name of container image that the container is running.
	Image string `json:"image" protobuf:"bytes,6,opt,name=image"`
	// ImageID is the image ID of the container's image.
	ImageID string `json:"imageID" protobuf:"bytes,7,opt,name=imageID"`
	// ContainerID is the ID of the container in the format '<type>://<container_id>'.
	ContainerID string `json:"containerID,omitempty" protobuf:"bytes,8,opt,name=containerID"`
	// Started indicates whether the container has finished its postStart lifecycle hook
	// and passed its startup probe.
	Started *bool `json:"started,omitempty" protobuf:"varint,9,opt,name=started"`
	AllocatedResources ResourceList `json:"allocatedResources,omitempty" protobuf:"bytes,10,rep,name=allocatedResources,casttype=ResourceList,castkey=ResourceName"`
	Resources *ResourceRequirements `json:"resources,omitempty" protobuf:"bytes,11,opt,name=resources"`
	// Status of volume mounts.
	VolumeMounts []VolumeMountStatus `json:"volumeMounts,omitempty" patchStrategy:"merge" patchMergeKey:"mountPath" protobuf:"bytes,12,rep,name=volumeMounts"`
	// User represents user identity information initially attached to the first process of the container
	User *ContainerUser `json:"user,omitempty" protobuf:"bytes,13,opt,name=user,casttype=ContainerUser"`
	// AllocatedResourcesStatus represents the status of various resources
	// allocated for this Pod.
	AllocatedResourcesStatus []ResourceStatus `json:"allocatedResourcesStatus,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,14,rep,name=allocatedResourcesStatus"`
	// StopSignal reports the effective stop signal for this container
	StopSignal *Signal `json:"stopSignal,omitempty" protobuf:"bytes,15,opt,name=stopSignal"`
}

// ContainerState holds a possible state of container.
// Only one of its members may be specified.
// If none of them is specified, the default one is ContainerStateWaiting.
type ContainerState struct {
	// Details about a waiting container
	Waiting *ContainerStateWaiting `json:"waiting,omitempty" protobuf:"bytes,1,opt,name=waiting"`
	// Details about a running container
	Running *ContainerStateRunning `json:"running,omitempty" protobuf:"bytes,2,opt,name=running"`
	// Details about a terminated container
	Terminated *ContainerStateTerminated `json:"terminated,omitempty" protobuf:"bytes,3,opt,name=terminated"`
}

一旦调度器将Pod分配到某个Node上，kubelet就通过容器运行时开始为Pod创建容器，容器的状态有三种：

Waiting

处于Waiting状态的容器，仍然处于容器完成启动所需要的操作，例如拉取镜像，向容器应用Secret数据等。kubectl可以查询处于Waiting状态的Reason字段。

Running

处于Running状态的容器，表示容器处于运行状态，并且没有问题发生。如果配置了postStart容器钩子，表示该钩子已经触发，并且已完成。

Terminated

处于Terminated状态的容器，表示曾经执行过，但是由于：正常结束或者某些原因失败，而处于终止状态。使用kubectl可以查看容器进入此状态的原因，退出代码，以及容器执行期间的起止时间。如果配置了preStop容器钩子，该钩子会在进入Terminated状态之前执行。

其实，Pod生命周期的整个状态的变化基本都是kubelet进行维护和修改的，例如kubelete进行定时执行所有Container的Probe的探测，然后根据探测结果进行Container 状态的修改。

// https://github.com/kubernetes/kubernetes/blob/release-1.34/pkg/kubelet/kubelet_pods.go

// generateAPIPodStatus creates the final API pod status for a pod, given the
// internal pod status. This method should only be called from within sync*Pod methods.
func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus, podIsTerminal bool) v1.PodStatus {
  ...
  allContainerStatuses := append(s.InitContainerStatuses, s.ContainerStatuses...)
	s.Conditions = append(s.Conditions, status.GeneratePodInitializedCondition(pod, &oldPodStatus, allContainerStatuses, s.Phase))
	s.Conditions = append(s.Conditions, status.GeneratePodReadyCondition(pod, &oldPodStatus, s.Conditions, allContainerStatuses, s.Phase))
	s.Conditions = append(s.Conditions, status.GenerateContainersReadyCondition(pod, &oldPodStatus, allContainerStatuses, s.Phase))
	s.Conditions = append(s.Conditions, v1.PodCondition{
		Type:               v1.PodScheduled,
		ObservedGeneration: podutil.CalculatePodConditionObservedGeneration(&oldPodStatus, pod.Generation, v1.PodScheduled),
		Status:             v1.ConditionTrue,
	})
  ...
}

例如Pod的Ready ConditionType的Status的检查逻辑如下，即一个Pod的Ready需要满足：

所有容器都已成功启动成功，ContainerReady的ConditionType的Status需要为True；
ReadinessGates所有的ConditionType的Status需要为True；

// https://github.com/kubernetes/kubernetes/blob/release-1.34/pkg/kubelet/status/generate.go

// GeneratePodReadyCondition returns "Ready" condition of a pod.
// The status of "Ready" condition is "True", if all containers in a pod are ready
// AND all matching conditions specified in the ReadinessGates have status equal to "True".
func GeneratePodReadyCondition(pod *v1.Pod, oldPodStatus *v1.PodStatus, conditions []v1.PodCondition, containerStatuses []v1.ContainerStatus, podPhase v1.PodPhase) v1.PodCondition {
	containersReady := GenerateContainersReadyCondition(pod, oldPodStatus, containerStatuses, podPhase)
	// If the status of ContainersReady is not True, return the same status, reason and message as ContainersReady.
	if containersReady.Status != v1.ConditionTrue {
		return v1.PodCondition{
			Type:               v1.PodReady,
			ObservedGeneration: podutil.CalculatePodConditionObservedGeneration(oldPodStatus, pod.Generation, v1.PodReady),
			Status:             containersReady.Status,
			Reason:             containersReady.Reason,
			Message:            containersReady.Message,
		}
	}

	// Evaluate corresponding conditions specified in readiness gate
	// Generate message if any readiness gate is not satisfied.
	unreadyMessages := []string{}
	for _, rg := range pod.Spec.ReadinessGates {
		...
	}
	...
	return v1.PodCondition{
		Type:               v1.PodReady,
		ObservedGeneration: podutil.CalculatePodConditionObservedGeneration(oldPodStatus, pod.Generation, v1.PodReady),
		Status:             v1.ConditionTrue,
	}
}

Pod 如何处理容器问题

Kubernetes 使用 Pod 规格（spec）中定义的 restartPolicy（重启策略）来管理 Pod 内容器的失败。该策略决定了当容器因错误或其他原因退出时，Kubernetes 应如何响应。其处理流程如下：

首次崩溃：Kubernetes 会根据 Pod 的 restartPolicy 立即尝试重启容器。
反复崩溃：在首次崩溃后，Kubernetes 会对后续重启应用指数退避延迟（exponential backoff delay），该机制在 restartPolicy 中有描述。这可以防止因频繁重启而压垮系统。
CrashLoopBackOff 状态：这表示某个容器正处于崩溃循环中（反复失败并重启），当前正受到退避延迟机制的限制。
退避重置：如果容器成功运行一段时间（例如 10 分钟），Kubernetes 会重置退避延迟，将下一次崩溃视为“首次崩溃”。

在实践中，当你使用 kubectl get pods 或 kubectl describe pod 查看 Pod 时，如果发现某个容器无法正常启动并持续尝试失败，就会看到 CrashLoopBackOff 这一状态。它是一种事件或条件，表明容器陷入了启动失败的循环。

换句话说，当容器进入崩溃循环时，Kubernetes 会应用容器重启策略中定义的指数退避延迟机制，防止故障容器因不断尝试启动而耗尽系统资源。

CrashLoopBackOff 的常见原因包括：

应用程序错误导致容器退出；
配置错误，例如环境变量不正确或缺少配置文件；
资源限制，例如容器内存或 CPU 不足，无法正常启动；
健康检查失败，比如应用未能在预期时间内开始提供服务；
容器的 存活探针（liveness probe）或 启动探针（startup probe）返回失败结果（如探针部分所述）。

要排查 CrashLoopBackOff 的根本原因，用户可以：

查看日志：使用 kubectl logs <pod-name> 查看容器日志，这通常是诊断崩溃原因最直接的方法。
检查事件：使用 kubectl describe pod <pod-name> 查看 Pod 的事件，可帮助发现配置或资源问题。
审查配置：确保 Pod 配置（如环境变量、挂载的卷）正确，且所有依赖的外部资源（如 Secret、ConfigMap、文件路径）都可用。
检查资源限制：确认容器分配了足够的 CPU 和内存。有时只需在 Pod 定义中增加资源请求/限制即可解决问题。
调试应用程序：应用代码中可能存在 Bug 或配置错误。尝试在本地或开发环境中运行相同的容器镜像，有助于定位应用层面的问题。

容器重启策略

Pod的spec中定一个了restartPolicy字段，定义了Pod中的容器的重启策略，取值为：

Always：只要容器终止就自动重启容器。默认值；
OnFailure：只有在容器错误退出（退出状态非零）时才重新启动容器。
Never：不会自动重启已终止的容器。

restartPolicy 应用于 Pod 中的应用容器和常规的 Init 容器。 Sidecar 容器忽略 Pod 级别的 restartPolicy 字段（后面会介绍Kubernetes新引入的内置的Sidecar Container），当Pod中的容器退出时，kubelet会按照指数回退方式计算重启的延迟：10s，20s，40s…，最长延迟为5分钟。一旦某容器执行了 10 分钟并且没有出现问题，kubelet 对该容器的重启回退计时器执行重置操作。

如果你的集群启用了 ContainerRestartRules 特性门控，你可以针对单个容器指定 restartPolicy 和 restartPolicyRules 来覆盖 Pod 重启策略。容器重启策略和规则适用于 Pod 中的应用容器以及常规的 Init 容器。

特性状态： Kubernetes v1.34 [alpha](disabled by default)

如下示例：一个重启策略为 Always 的 Pod，其中包含一个只执行一次的 Init 容器。如果 Init 容器失败，则 Pod 也会失败。这样可以在初始化失败时让 Pod 失败，但在初始化成功后保持 Pod 运行：

apiVersion: v1
kind: Pod
metadata:
  name: fail-pod-if-init-fails
spec:
  restartPolicy: Always
  initContainers:
  - name: init-once      # 这个 Init 容器只尝试一次。如果失败，Pod 将失败。
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'echo "Failing initialization" && sleep 10 && exit 1']
    restartPolicy: Never
  containers:
  - name: main-container # 一旦初始化成功，此容器会始终被重启。
    image: docker.io/library/busybox:1.28
    command: ['sh', '-c', 'sleep 1800 && exit 0']

容器生命周期回调

Kubernetes为容器提供了两个生命周期回调，能够管理容器的生命周期，如下：

postStart

这个回调在容器被创建之后，立即执行，但是不能保证回调会在容器入口ENTRYPOINT之前执行。没有参数传递给处理程序。

可以参考Kubernetes关于容器创建的源码：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/pkg/kubelet/kuberuntime/kuberuntime_container.go

func (m *kubeGenericRuntimeManager) startContainer(ctx context.Context, podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, spec *startSpec, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, podIPs []string) (string, error) {
	container := spec.container

	// Step 1: pull the image.
	imageRef, msg, err := m.imagePuller.EnsureImageExists(ctx, pod, container, pullSecrets, podSandboxConfig)
  
  // Step 2: create the container.
  containerID, err := m.runtimeService.CreateContainer(ctx, podSandboxID, containerConfig, podSandboxConfig)
  
  // Step 3: start the container.
	err = m.runtimeService.StartContainer(ctx, containerID)
  
  // Step 4: execute the post start hook.
  if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
    msg, handlerErr := m.runner.Run(ctx, kubeContainerID, pod, container, container.Lifecycle.PostStart)
		if handlerErr != nil {
    }
    ...
  }
..  
}

preStop

在容器因 API 请求或者管理事件（诸如存活态探针、启动探针失败、资源抢占、资源竞争等）而被终止之前，此回调会被调用。在用来停止容器的 TERM 信号被发出之前，回调必须执行结束。**Pod 的终止宽限周期（默认30s）在 PreStop 回调被执行之前即开始计数，所以无论回调函数的执行结果如何，容器最终都会在 Pod 的终止宽限期内被终止**。没有参数会被传递给处理程序。

如果 PostStart 或 PreStop 回调失败（返回值非0），它会**杀死容器**。PostStart执行失败后，会根据容器的重启策略进行容器的恢复， PreStop 回调失败后，本来就是要发送终止信号，结束容器；

Kubernetes为容器的回调处理程序提供了**两种实现**：

Exec：在容器的 cgroups 和名字空间中执行特定的命令（例如 pre-stop.sh）。命令所消耗的资源计入容器的资源消耗。
HTTP：对容器上特定的端点执行HTTP请求。

容器探针

前面的容器生命周期的回调，是容器在其启动或停止前让业务可以自定义一些检查或启动应用程序所需的辅助任务，例如准备应用程序的配置文件或设置应用程序的状态等。

而容器探针是kubelet对容器进行**定期诊断**，然后判断容器状态的一种方式；

Kubernetes提供了**三种类型的探针**：

livenessProbe

存活探针，表示容器是否正常运行，如果探测失败，kubelet会kill掉容器，并根据容器的重启策略决定后面容器的操作，如果未设置该探针，则默认状态为Success。

如果容器中的进程在遇到问题或者不健康的状态下**能够自行崩溃，则不一定需要livenessProbe**，因为kubelet会自动根据Pod的restartPolicy字段，自动执行修复操作。

如果容器中的进程**无法在遇到问题时自动崩溃**，例如发生死锁，则需要配合livenessProbe进行健康检测。

readinessProbe:

就绪探针，表示容器是否准备好提供服务，如果就绪探测失败，**端点控制器将从与 Pod 匹配的所有服务的端点列表中删除该 Pod 的 IP 地址**。如果提供了readinessProbe，初始默认为Failure。

如果服务对后端服务有**严格的依赖性**，可以同时实现livenessProbe和readinessProbe，两个不会相互影响，livenessProbe用于检测程序本身的健康，readinessProbe用于检查所需的后端服务是否可用，避免流量导向只能返回错误信息的Pod。

startupProbe：

启动探针，表示容器中的应用是否已经完成启动，如果提供了startupProbe，**其他类型探针在startupProbe探测成功前，会被一直禁用**。如果startupProbe探测失败，kubelet会kill掉容器，并根据容器的重启策略决定后面容器的操作，如果未设置该探针，则默认状态为Success。

对于所包含的容器需要**较长时间才能启动就绪的 Pod 而言，启动探针是有用的**。你不再需要配置一个较长的存活态探测时间间隔，只需要设置另一个独立的配置选定，对启动期间的容器执行探测，从而允许使用远远超出存活态时间间隔所允许的时长。

如果你的容器启动时间通常超出 initialDelaySeconds + failureThreshold × periodSeconds 总值，你应该设置一个启动探测，对存活态探针所使用的同一端点执行检查。 periodSeconds 的默认值是 10 秒。你应该将其 failureThreshold 设置得足够高，以便容器有充足的时间完成启动，**并且避免更改存活态探针所使用的默认值**。这一设置有助于减少死锁状况的发生。

探针的数据结构Probe定义如下：关于Probe的实现，可以参考prober源码实现；

// ProbeHandler defines a specific action that should be taken in a probe.
// One and only one of the fields must be specified.
type ProbeHandler struct {
	Exec *ExecAction 
	HTTPGet *HTTPGetAction
	TCPSocket *TCPSocketAction
	GRPC *GRPCAction
}

type Probe struct {
	// The action taken to determine the health of a container
	ProbeHandler `json:",inline" protobuf:"bytes,1,opt,name=handler"`
  
  // 容器启动后，等待指定seconds后，才启动就绪，存活，启动探针
	// +optional
	InitialDelaySeconds int32 
  
	// 探针执行的超时时间，Defaults to 1 second. Minimum value is 1.
	// +optional
	TimeoutSeconds int32 
  
	// 探针执行的周期，Default to 10 seconds. Minimum value is 1.
	// +optional
	PeriodSeconds int32 
  // 在容器检测失败后，连续成功探测指定次数后，认为容器状态正常
	// Defaults to 1. Must be 1 for liveness and startup. Minimum value is 1.
	// +optional
	SuccessThreshold int32 
  
	// 连续探测失败指定次数后，容器状态变为Failure
	// Defaults to 3. Minimum value is 1.
	// +optional
	FailureThreshold int32 
  
  // 探测失败后，Pod给予优雅退出的时间，
	// 如果设置为0，会立刻kill掉容器
	// Minimum value is 1. spec.terminationGracePeriodSeconds is used if unset.
	// +optional
	TerminationGracePeriodSeconds *int64
}

针对**failureThreshold阈值的连续探测失败的处理**，需要注意：

对于启动探针或存活探针而言， Kubernetes 会将容器视为不健康并为这个特定的**容器触发重启操作**。 kubelet 会考虑该容器的 terminationGracePeriodSeconds 设置。
对于失败的就绪探针，kubelet 继续运行检查失败的容器，并继续运行更多探针；因为检查失败，kubelet 将 Pod 的 Ready 状况设置为 false。上面说了，就绪探针检测失败，**端点控制器将从与 Pod 匹配的所有服务的端点列表中删除该 Pod 的 IP 地址**。

探针检查提供了4种检查机制：

exec：在容器内执行制定命令，如果命令退出时返回码为 0 则认为诊断成功。
grpc：使用 gRPC 执行一个远程过程调用。目标应该实现 gRPC 健康检查。如果响应的状态是 “SERVING”，则认为诊断成功。 gRPC 探针是一个 Alpha 特性，只有在你启用了 “GRPCContainerProbe” 特性门控时才能使用。
httpGet：对容器的 IP 地址上指定端口和路径执行 HTTP GET 请求。如果响应的状态码大于等于 200 且小于 400，则诊断被认为是成功的。
tcpSocket：对容器的 IP 地址上的指定端口执行 TCP 检查。如果端口打开，则诊断被认为是成功的。如果远程系统（容器）在打开连接后立即将其关闭，这算作是健康的。

Init & Sidecar Container

Init 容器（Init Containers）：它们是 Pod 中在应用容器（app containers）启动之前运行的特殊容器。Init 容器可以包含应用镜像中未提供的工具程序或初始化脚本。

您可以在 Pod 的配置中，与描述应用容器的 containers 数组并列，指定 initContainers 字段。如下：

// // https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/core/v1/types.go

// PodSpec is a description of a pod.
type PodSpec struct {
	...
	// List of initialization containers belonging to the pod.
	// Init containers are executed in order prior to containers being started. 
	InitContainers []Container `json:"initContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,20,rep,name=initContainers"`
	// List of containers belonging to the pod.
	// Containers cannot currently be added or removed.
	// There must be at least one container in a Pod.
	Containers []Container `json:"containers" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,2,rep,name=containers"`
	// List of ephemeral containers run in this pod. Ephemeral containers may be run in an existing
	// pod to perform user-initiated actions such as debugging. This list cannot be specified when
	// creating a pod, and it cannot be modified by updating the pod spec. In order to add an
	// ephemeral container to an existing pod, use the pod's ephemeralcontainers subresource.
	EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" patchStrategy:"merge" patchMergeKey:"name" protobuf:"bytes,34,rep,name=ephemeralContainers"`
  ...
}

一个 Pod 中可以运行多个应用容器，也可以包含一个或多个 Init 容器。这些 Init 容器会在任何应用容器启动之前依次运行。

Init 容器与普通容器非常相似，唯一的区别在于：

Init 容器总是运行到完成（即执行完后退出，不能长期运行）。
每个 Init 容器必须成功完成之后，下一个 Init 容器才会启动。

如果某个 Init 容器失败，kubelet 会不断重启它，直到其成功为止。但是，如果该 Pod 的 restartPolicy 设置为 Never，并且在 Pod 启动过程中某个 Init 容器失败，Kubernetes 会将整个 Pod 视为失败。

Init 容器常用于执行前置任务，比如等待某个服务就绪、下载配置文件、初始化数据库、执行权限设置等。由于它们运行在应用容器之前，且必须成功完成，因此非常适合用于构建健壮的启动依赖逻辑。

下面的例子定义了一个具有 2 个 Init 容器的简单 Pod。第一个等待 myservice 启动，第二个等待 mydb 启动。一旦这两个 Init 容器都启动完成，Pod 将启动 spec 节中的应用容器。

apiVersion: v1
kind: Pod
metadata:
  name: myapp-pod
  labels:
    app.kubernetes.io/name: MyApp
spec:
  containers:
  - name: myapp-container
    image: busybox:1.28
    command: ['sh', '-c', 'echo The app is running! && sleep 3600']
  initContainers:
  - name: init-myservice
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
  - name: init-mydb
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup mydb.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for mydb; sleep 2; done"]

Kubernetes 在较新的版本引入了边车容器（Sidecar Container） ，是指在主应用容器启动之前启动并持续运行的容器。

Kubernetes v1.33 [stable](enabled by default)

Sidecar容器作为 Init 容器的一个特例来实现，和Init容器的差异主要有：

Pod 启动后，边车容器仍保持运行状态。在创建 Init 容器时将 restartPolicy 设置为 Always，则它将在整个 Pod 的生命周期内启动并持续运行。这对于运行与主应用容器分离的支持服务非常有帮助。
Init 容器不支持 lifecycle、livenessProbe、readinessProbe 或 startupProbe，而边车容器支持所有这些探针以控制其生命周期。

如果为此 Init 容器指定了 readinessProbe，其结果将用于确定 Pod 的 ready 状态。由于这些容器被定义为 Init 容器，所以它们享有与其他 Init 容器相同的顺序和按序执行保证，从而允许将边车容器与常规 Init 容器混合使用，支持复杂的 Pod 初始化流程。

下面是一个包含两个容器的 Deployment 示例，其中一个容器是边车形式：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  labels:
    app: myapp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
        - name: myapp
          image: alpine:latest
          command: ['sh', '-c', 'while true; do echo "logging" >> /opt/logs.txt; sleep 1; done']
          volumeMounts:
            - name: data
              mountPath: /opt
      initContainers:
        - name: logshipper
          image: alpine:latest
          restartPolicy: Always
          command: ['sh', '-c', 'tail -F /opt/logs.txt']
          volumeMounts:
            - name: data
              mountPath: /opt
      volumes:
        - name: data
          emptyDir: {}

Ephemeral Containers

Pod 是 Kubernetes 应用程序的基本构建块。由于 Pod 是一次性且可替换的，因此一旦 Pod 创建，就无法将容器加入到 Pod 中。取而代之的是，通常使用 Deployment 以受控的方式来删除并替换 Pod。有时有必要检查现有 Pod 的状态。例如，对于难以复现的故障进行排查。在这些场景中，可以在现有 Pod 中运行临时容器来检查其状态并运行任意命令。

临时容器是使用 API 中的一种特殊的 ephemeralcontainers 处理器进行创建的，而不是直接添加到 pod.spec 段，因此无法使用 kubectl edit 来添加一个临时容器。

与常规容器一样，将临时容器添加到 Pod 后，将不能更改或删除临时容器。

临时容器的用途：当由于容器崩溃或容器镜像不包含调试工具而导致 kubectl exec 无用时，临时容器对于交互式故障排查很有用。

尤其是，Distroless 镜像允许用户部署最小的容器镜像，从而减少攻击面并减少故障和漏洞的暴露。由于 distroless 镜像不包含 Shell 或任何的调试工具，因此很难单独使用 kubectl exec 命令进行故障排查。

Workload管理

文章开头已经简单介绍了，我们的Workload，即应用程序是以容器的形式运行在 Pod 中的；然而，直接管理单个 Pod 会非常繁琐。例如，如果某个 Pod 发生故障，你可能希望自动启动一个新的 Pod 来替代它。Kubernetes 可以为你自动完成这项任务。

我们通过 Kubernetes API 创建一种比 Pod 抽象层级更高的“工作负载对象”（如 Deployment、StatefulSet 等），然后 Kubernetes 的控制平面会根据你定义的工作负载对象的规格（specification），自动为你管理底层的 Pod 对象。

所以文章开头介绍Pods的时候其实也开门见山的阐述了一个重要的观点：不要使用独立的 Pod（未绑定到 ReplicaSet 或 Deployment 的 Pod），如果节点发生故障，将不会重新调度这些独立的 Pod。

下面我们开始介绍Workload Object：

Deployment

Deployment 是 Kubernetes 中用于声明式地管理应用部署的高级资源对象，Deployment 用于管理一组 Pod，以运行应用程序的工作负载，通常用于无状态（stateless）的应用。

Deployment为Pod和ReplicaSet提供了声明式更新的方法，通过在Deployment中声明spec期望的状态，Deployment Controller会以受控的速率更改目标status实际状态。Deployment是通过创建和管理ReplicaSet以达到编排 Pod 创建、删除及更新的机制。

Deployment的典型应用：

创建 Deployment 以部署一个ReplicaSet ，ReplicaSet 在后台创建 Pod。
滚动升级：更新Deployment的PodTemplateSpec，来声明Pod的新的期望状态，新的ReplicaSet会被创建，Deployment 会逐步增加新 ReplicaSet 的副本数，同时逐步减少旧 ReplicaSet 的副本数，从而以受控的速率替换 Pod。每次创建新的 ReplicaSet 都会更新 Deployment 的修订版本（revision）。
回滚到较早的 Deployment 版本。
扩缩容Pod。
暂停（pause）Deployment 的部署过程， 以便对其 PodTemplateSpec 进行多项修改，然后恢复（resume）部署，以启动新一轮的滚动更新。
利用 Deployment 的状态来判断部署是否卡住（stuck）
清理不再需要的旧 ReplicaSet。

下面看一下DeploymentSpec的结构的定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

// DeploymentSpec is the specification of the desired behavior of the Deployment.
type DeploymentSpec struct {
	// Number of desired pods. This is a pointer to distinguish between explicit
	// zero and not specified. Defaults to 1.
	Replicas *int32 `json:"replicas,omitempty" protobuf:"varint,1,opt,name=replicas"`

	// Label selector for pods. Existing ReplicaSets whose pods are
	// selected by this will be the ones affected by this deployment.
	// It must match the pod template's labels.
	Selector *metav1.LabelSelector `json:"selector" protobuf:"bytes,2,opt,name=selector"`

	// Template describes the pods that will be created.
	// The only allowed template.spec.restartPolicy value is "Always".
	Template v1.PodTemplateSpec `json:"template" protobuf:"bytes,3,opt,name=template"`

	// The deployment strategy to use to replace existing pods with new ones.
	Strategy DeploymentStrategy `json:"strategy,omitempty" patchStrategy:"retainKeys" protobuf:"bytes,4,opt,name=strategy"`

	MinReadySeconds int32 `json:"minReadySeconds,omitempty" protobuf:"varint,5,opt,name=minReadySeconds"`

	// The number of old ReplicaSets to retain to allow rollback.
	RevisionHistoryLimit *int32 `json:"revisionHistoryLimit,omitempty" protobuf:"varint,6,opt,name=revisionHistoryLimit"`

	// Indicates that the deployment is paused.
	Paused bool `json:"paused,omitempty" protobuf:"varint,7,opt,name=paused"`

	ProgressDeadlineSeconds *int32 `json:"progressDeadlineSeconds,omitempty" protobuf:"varint,9,opt,name=progressDeadlineSeconds"`
}

如下示例为创建Deployment workload. 其中创建了一个 ReplicaSet，负责启动三个 nginx Pod：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80

DeploymentSpec详解

Replicas（副本个数）

.spec.replicas 是一个可选字段，用于指定期望的 Pod 数量，默认值为 1。

如果你手动对一个 Deployment 进行扩缩容（例如，通过命令 kubectl scale deployment deployment --replicas=X），之后又基于一个清单文件（manifest）更新该 Deployment（例如：运行 kubectl apply -f deployment.yaml），那么应用该清单文件的操作将会覆盖你之前手动设置的副本数。

如果一个 HorizontalPodAutoscaler（HPA，水平 Pod 自动扩缩器）或其他类似的水平扩缩 API 正在管理该 Deployment 的扩缩容，请不要设置.spec.replicas字段。

相反，应让 Kubernetes 控制平面自动管理 .spec.replicas 字段。

PodTemplateSpec（Pod模板）

在 Deployment 的 .spec 中 .spec.template 是必需字段。

.spec.template是一个 Pod 模板。其结构与普通的 Pod 完全相同，唯一的区别是它被嵌套在 Deployment 中，因此不包含 apiVersion 和 kind 字段。Pod 模版在前面Pods章节也介绍过了。

除了 Pod 本身所需的必要字段外，Deployment 中的 Pod 模板还必须指定合适的标签（labels）和重启策略（restart policy）：

标签（Labels）：请确保所设置的标签不会与其他控制器（如其他 Deployment、StatefulSet 等）的标签选择器冲突。详情请参见 selector 部分。
重启策略（Restart Policy）：只允许设置为 Always。如果未显式指定，其默认值即为 Always。

Selector（选择器）

.spec.selector 是一个必需字段，用于指定此 Deployment 所管理的目标 Pod 的标签选择器（label selector）。

.spec.selector 必须与 .spec.template.metadata.labels 匹配，否则该 Deployment 会被 Kubernetes API 拒绝。

在 API 版本 apps/v1 中，如果未显式设置 .spec.selector 和 .metadata.labels，它们不会自动默认为 .spec.template.metadata.labels 的值，因此必须明确指定。此外请注意，在 apps/v1 中，Deployment 创建后，.spec.selector 是不可变的（immutable），即无法更新。

Deployment 会根据以下规则管理 Pod：

如果已有 Pod 的标签匹配该选择器，但其模板（即 Pod 的实际定义）与当前 .spec.template 不一致，或者匹配的 Pod 总数超过了 .spec.replicas 所指定的数量，Deployment 会终止这些多余的或不符合模板的 Pod。
如果当前匹配选择器的 Pod 数量少于 .spec.replicas 指定的期望数量，Deployment 会根据 .spec.template 创建新的 Pod。

你不应手动创建其他标签与该选择器匹配的 Pod，无论是通过直接创建 Pod，还是通过创建另一个 Deployment、ReplicaSet 或 ReplicationController。
如果你这样做，当前的 Deployment 会误认为这些 Pod 是它自己创建和管理的。

Kubernetes 并不会阻止你创建具有重叠标签选择器的多个控制器，但这样做会导致多个控制器互相冲突（“打架”）——它们会反复创建或删除彼此的 Pod，导致系统行为异常、不可预测。

因此，确保每个控制器（如 Deployment）使用唯一且不重叠的标签选择器，是良好实践。

Strategy

.spec.strategy 指定了使用新 Pod 替换旧 Pod 时所采用的策略。
.spec.strategy.type 可以是 "Recreate" 或 "RollingUpdate"，默认值为 "RollingUpdate"。

我们看一下关于Strategy的定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

// DeploymentStrategy describes how to replace existing pods with new ones.
type DeploymentStrategy struct {
	// Type of deployment. Can be "Recreate" or "RollingUpdate". Default is RollingUpdate.
	Type DeploymentStrategyType `json:"type,omitempty" protobuf:"bytes,1,opt,name=type,casttype=DeploymentStrategyType"`

	// Rolling update config params. Present only if DeploymentStrategyType =
	// RollingUpdate.
	RollingUpdate *RollingUpdateDeployment `json:"rollingUpdate,omitempty" protobuf:"bytes,2,opt,name=rollingUpdate"`
}

// +enum
type DeploymentStrategyType string

const (
	// Kill all existing pods before creating new ones.
	RecreateDeploymentStrategyType DeploymentStrategyType = "Recreate"

	// Replace the old ReplicaSets by new one using rolling update i.e gradually scale down the old ReplicaSets and scale up the new one.
	RollingUpdateDeploymentStrategyType DeploymentStrategyType = "RollingUpdate"
)

// Spec to control the desired behavior of rolling update.
type RollingUpdateDeployment struct {
	MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,1,opt,name=maxUnavailable"`
	MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,2,opt,name=maxSurge"`
}

Recreate（重建）Deployment

当 .spec.strategy.type == "Recreate" 时，所有现有的 Pod 会在创建新 Pod 之前被终止。

这仅在升级过程中保证“先终止旧 Pod，再创建新 Pod”。

如果你升级一个 Deployment，所有旧版本的 Pod 会立即被终止，并且会等待这些 Pod 成功删除后，才创建任何新版本的 Pod。但如果你手动删除一个 Pod，其生命周期由 ReplicaSet 控制，此时会立即创建一个替代 Pod（即使旧 Pod 仍处于 Terminating 状态）。

如果你需要对 Pod 数量有“至多 N 个”的严格保证，应考虑使用 StatefulSet。

Rolling Update（滚动更新）Deployment

当 .spec.strategy.type == "RollingUpdate" 时，Deployment 会以滚动更新的方式更新 Pod（即逐步缩减旧 ReplicaSet 的规模，同时逐步扩增新 ReplicaSet 的规模）。

你可以通过指定 maxUnavailable 和 maxSurge 来控制滚动更新的过程。

Max Unavailable（最大不可用数量）

.spec.strategy.rollingUpdate.maxUnavailable 是一个可选字段，用于指定在更新过程中允许不可用的 Pod 的最大数量。该值可以是一个绝对数值（例如 5），也可以是期望 Pod 数量的百分比（例如 10%）。百分比值会向下取整。

如果 .spec.strategy.rollingUpdate.maxSurge 设置为 0，则该值不能为 0。 默认值为 25%。

例如，当该值设为 30% 时，在滚动更新开始时，旧 ReplicaSet 可立即缩减至期望 Pod 数量的 70%。一旦新的 Pod 就绪，旧 ReplicaSet 可进一步缩减，随后新 ReplicaSet 继续扩容，从而确保在整个更新过程中，**可用的 Pod 数量始终不少于期望数量的 70%**。

Max Surge（最大激增数量）

.spec.strategy.rollingUpdate.maxSurge 是一个可选字段，用于指定可以超出期望 Pod 数量而额外创建的 Pod 的最大数量。

该值可以是一个绝对数值（例如 5），也可以是期望 Pod 数量的百分比（例如 10%）。百分比值会向上取整。如果 maxUnavailable 设置为 0，则该值不能为 0。 默认值为 25%。

例如，当该值设为 30% 时，在滚动更新开始时，新 ReplicaSet 可立即扩容，使得**新旧 Pod 的总数不超过期望数量的 130%。一旦旧的 Pod 被终止，新 ReplicaSet 可进一步扩容，从而确保在整个更新过程中，运行中的 Pod 总数最多不超过期望数量的 130%**。

针对默认的 "RollingUpdate"更新方式，如果不设置maxSurge和maxUnavailable，那么默认值都是为25%，那在两者都有值的情况下，Deployment是如何滚动更新的呢， 先终止部分旧Pod还是先创建部分新Pod？，通过如下Deployment的Controller可以得知，根据Strategy的配置 总是会优先创建新Pod的，然后再去尝试终止旧的Pod。

// https://github.com/kubernetes/kubernetes/blob/release-1.34/pkg/controller/deployment/rolling.go

// rolloutRolling implements the logic for rolling a new replica set.
func (dc *DeploymentController) rolloutRolling(ctx context.Context, d *apps.Deployment, rsList []*apps.ReplicaSet) error {
	newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(ctx, d, rsList, true)
	if err != nil {
		return err
	}
	allRSs := append(oldRSs, newRS)

	// Scale up, if we can.
	scaledUp, err := dc.reconcileNewReplicaSet(ctx, allRSs, newRS, d)
	if err != nil {
		return err
	}
	if scaledUp {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
	}

	// Scale down, if we can.
	scaledDown, err := dc.reconcileOldReplicaSets(ctx, allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
	if err != nil {
		return err
	}
	if scaledDown {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
	}

	if deploymentutil.DeploymentComplete(d, &d.Status) {
		if err := dc.cleanupDeployment(ctx, oldRSs, d); err != nil {
			return err
		}
	}

	// Sync deployment status
	return dc.syncRolloutStatus(ctx, allRSs, newRS, d)
}

以下是一些使用 maxUnavailable 和 maxSurge 的滚动更新 Deployment 示例，滚动更新的过程，每次先创建一个新的Pod，然后终止一个旧的Pod，直至全部滚动更新完成：

apiVersion: apps/v1
kind: Deployment
metadata:
 name: nginx-deployment
 labels:
   app: nginx
spec:
 replicas: 3
 selector:
   matchLabels:
     app: nginx
 template:
   metadata:
     labels:
       app: nginx
   spec:
     containers:
     - name: nginx
       image: nginx:1.14.2
       ports:
       - containerPort: 80
 strategy:
   type: RollingUpdate
   rollingUpdate:
     maxSurge: 1
     maxUnavailable: 1

Progress Deadline Seconds（进度截止时间）

.spec.progressDeadlineSeconds 是一个可选字段，用于指定在系统判定 Deployment 更新失败之前，你愿意等待其取得进展的最长时间（以秒为单位）。一旦超过该时间仍未取得进展，Deployment 的状态中会报告一个状态条件（condition）：

type: Progressing
status: "False"
reason: ProgressDeadlineExceeded

Deployment 控制器会持续重试该 Deployment。此字段的默认值为 600 秒（10 分钟）。

未来一旦实现自动回滚功能，Deployment 控制器将在检测到上述条件时立即自动回滚到之前的稳定版本。

注意：如果显式设置了该字段，其值必须大于 .spec.minReadySeconds。

Min Ready Seconds（最小就绪时间）

.spec.minReadySeconds 是一个可选字段，用于指定一个新创建的 Pod 在所有容器均未崩溃的前提下，必须持续处于就绪（Ready）状态的最短时间，之后才被认为“可用”（available）。

默认值为 0，即 Pod 一旦变为就绪状态，就立即被视为可用。关于 Pod 何时被视为“就绪”，请参阅 容器探针（Container Probes）。

Paused

.spec.paused 是一个可选的布尔类型字段，用于暂停或恢复一个 Deployment。暂停的 Deployment 与未暂停的 Deployment 唯一的区别在于：只要 Deployment 处于暂停状态，对其 .spec.template（即 Pod 模板）的任何更改都不会触发新的滚动更新（rollout）。

Deployment 在创建时默认不会被暂停。

.spec.paused: true 允许你临时“冻结”Deployment 的自动更新行为。在此状态下，你可以安全地修改 Pod 模板（例如更新镜像、环境变量等），而不会立即触发滚动更新。之后，当你取消暂停（设为 false）时，Deployment 会一次性应用所有已累积的模板变更，并启动一次滚动更新。

Pod-template-hash

我们可以看到通过Deployment创建的Replicaset和Pod资源都会有一个pod-template-hash的标签，如下：

$ kubectl get all --show-labels
NAME                                       READY   STATUS    RESTARTS   AGE    LABELS
pod/wecom-read-it-later-7c58678d5b-r6n9w   1/1     Running   0          44d    app.kubernetes.io/instance=wecom-read-it-later,app.kubernetes.io/name=wecom-read-it-later,pod-template-hash=7c58678d5b


NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE    LABELS
deployment.apps/wecom-read-it-later   1/1     1            1           51d    app.kubernetes.io/instance=wecom-read-it-later,app.kubernetes.io/managed-by=Helm,app.kubernetes.io/name=wecom-read-it-later,app.kubernetes.io/version=0.0.0-master.330bf3e6,helm.sh/chart=wecom-read-it-later-0.0.0-master.330bf3e6

NAME                                             DESIRED   CURRENT   READY   AGE    LABELS
replicaset.apps/wecom-read-it-later-7c58678d5b   1         1         1       44d    app.kubernetes.io/instance=wecom-read-it-later,app.kubernetes.io/name=wecom-read-it-later,pod-template-hash=7c58678d5b

pod-template-hash 是 Kubernetes 中由 Deployment 控制器自动添加的一个标签（label），其主要作用是确保同一个 Deployment 下的不同 ReplicaSet（副本集）管理的 Pod 不重叠，避免控制器之间的冲突。具体作用和生成方式

当你创建一个 Deployment 时，Deployment 控制器会根据 Pod 的模板（.spec.template）计算一个哈希值（hash），这就是 pod-template-hash 的值。
这个哈希值会被添加到：
- ReplicaSet 的 selector（选择器）。
- ReplicaSet 的 Pod 模板标签。
- 实际创建的 Pod 的标签中。
同时，ReplicaSet 的名称通常也会以这个哈希值结尾，例如：my-deployment-75675f5897。

为什么需要它？

Deployment 在滚动更新（rolling update）时，会创建多个 ReplicaSet（旧版本和新版本）。
如果只用用户自定义的标签（如 app=nginx）作为 selector，不同 ReplicaSet 可能会匹配到相同的 Pod，导致控制器“抢夺” Pod 管理权，造成混乱。
通过添加唯一的 pod-template-hash，每个 ReplicaSet 只管理带有对应哈希值的 Pod，确保不同版本的 ReplicaSet 互不干扰。

Deployment的更新

开宗明义：

仅当 Deployment 的 Pod 模板（即 .spec.template）发生变化时（例如模板中的标签或容器镜像被更新），才会触发一次滚动更新（rollout）。其他操作，如扩缩容（scaling），不会触发滚动更新。

我们可以按照以下步骤更新你的 Deployment：假设我们要将 nginx Pod 的镜像从 nginx:1.14.2 更新为 nginx:1.16.1。

1	kubectl set image deployment.apps/nginx nginx=nginx:1.16.1

输出类似：

1	deployment.apps/nginx-deployment image updated

或者，你可以直接编辑 Deployment：

1	kubectl edit deployment/nginx-deployment

将 .spec.template.spec.containers[0].image 从 nginx:1.14.2 改为 nginx:1.16.1。

输出类似：

1	deployment.apps/nginx-deployment edited

查看滚动更新状态：

1	kubectl rollout status deployment/nginx-deployment

可能输出：

Waiting for deployment "nginx" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "nginx" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "nginx" rollout to finish: 1 out of 3 new replicas have been updated...
Waiting for deployment "nginx" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "nginx" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "nginx" rollout to finish: 2 out of 3 new replicas have been updated...
Waiting for deployment "nginx" rollout to finish: 1 old replicas are pending termination...
Waiting for deployment "nginx" rollout to finish: 1 old replicas are pending termination...
deployment "nginx" successfully rolled out

滚动更新成功后，运行：

1	kubectl get deployments

输出类似：

1 2	NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 3/3 3 3 36s

运行以下命令，查看 ReplicaSet 的变化：

1	kubectl get rs

输出类似：

1
2
3

NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-1564180365   3         3         3       6s   # 新 ReplicaSet
nginx-deployment-2035384211   0         0         0       36s  # 旧 ReplicaSet（已缩容）

再查看 Pod：

1	kubectl get pods

输出仅包含新版本的 Pod：

NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-1564180365-khku8   1/1     Running   0          14s
nginx-deployment-1564180365-nacti   1/1     Running   0          14s
nginx-deployment-1564180365-z9gth   1/1     Running   0          14s

滚动更新策略保障，在前面介绍DeploymentSpec的Strategy更新策略的时候有详细介绍过，这里再简单说明一下，Deployment 确保在更新过程中：

不会让太多 Pod 同时不可用：默认至少保持 75% 的期望副本数可用（即最多 25% 不可用）；
不会让 Pod 总数远超预期：默认最多允许 125% 的期望副本数（即最多额外创建 25% 的 Pod）。

例如，在上述 3 副本的 Deployment 中：

更新时，先创建 1 个新 Pod；
等它就绪后，再删除 1 个旧 Pod；
如此交替进行，确保始终至少有 3 个 Pod 可用，最多同时存在 4 个 Pod（3 + 1 = 125%）。

如果副本数是 4，则更新过程中 Pod 总数会在 3～5 之间波动。

如下可以查看Deployment的信息：

$ kubectl describe deployment.apps/nginx
Name:                   nginx
Namespace:              default
CreationTimestamp:      Fri, 26 Dec 2025 16:52:13 +0800
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 6
Selector:               app=nginx
Replicas:               3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=nginx
  Containers:
   nginx:
    Image:         nginx:1.16.1
    Port:          <none>
    Host Port:     <none>
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  nginx-bf5d5cf98 (0/0 replicas created), nginx-6bc884fb4d (0/0 replicas created), nginx-5c5f8cd7 (0/0 replicas created), nginx-7584b6f84c (0/0 replicas created)
NewReplicaSet:   nginx-5f75c649df (3/3 replicas created)
Events:

Kubernetes 在计算 availableReplicas 时不计入正在终止的 Pod（即状态为 Terminating 的 Pod），因此：

在滚动更新期间，你可能会看到 **Pod 总数略高于 replicas + maxSurge**；
这是因为旧 Pod 虽然被标记为终止，但仍在 terminationGracePeriodSeconds（默认 30 秒）内未完全消失；
所以实际资源消耗可能短暂超过预期上限，直到旧 Pod 真正退出。

Deployement的回滚

我们可以查看一个Deployment的所有版本：

$ kubectl rollout history deployment.apps/nginx
deployment.apps/nginx 
REVISION  CHANGE-CAUSE
1         <none>
3         <none>
4         <none>
5         <none>
6         <none>

详细查看某个版本的信息如下：

$ kubectl rollout history deployment.apps/nginx --revision 5
deployment.apps/nginx with revision #5
Pod Template:
  Labels:       app=nginx
        pod-template-hash=7584b6f84c
  Containers:
   nginx:
    Image:      nginx:latest
    Port:       <none>
    Host Port:  <none>
    Environment:        <none>
    Mounts:     <none>
  Volumes:      <none>
  Node-Selectors:       <none>
  Tolerations:  <none>

我们可以通过如下命令进行回滚操作，如下：

1 2	# kubectl rollout undo deployment.apps/nginx 默认回滚到上一个版本 $ kubectl rollout undo deployment.apps/nginx --to-revision=5

如下，Deployment会被回滚到旧版本，根据pod-template-hash可以识别出来，但是revision会递增。

$ kubectl describe deployment.apps/nginx
Name:                   nginx
Namespace:              default
CreationTimestamp:      Fri, 26 Dec 2025 16:52:13 +0800
Labels:                 app=nginx
Annotations:            deployment.kubernetes.io/revision: 7
Selector:               app=nginx
Replicas:               3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=nginx
  Containers:
   nginx:
    Image:         nginx:latest
    Port:          <none>
    Host Port:     <none>
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  nginx-bf5d5cf98 (0/0 replicas created), nginx-5f75c649df (0/0 replicas created), nginx-6bc884fb4d (0/0 replicas created), nginx-5c5f8cd7 (0/0 replicas created)
NewReplicaSet:   nginx-7584b6f84c (3/3 replicas created)

Deployement的扩缩容

你可以使用以下命令对 Deployment 进行扩缩容：

1	kubectl scale deployment/nginx-deployment --replicas=10

输出类似：

1	deployment.apps/nginx-deployment scaled

假设你的集群已启用 Horizontal Pod Autoscaler（HPA，水平 Pod 自动扩缩），你可以为该 Deployment 设置一个自动扩缩器，并根据现有 Pod 的 CPU 使用率，指定你希望运行的 Pod 最小和最大数量：

1	kubectl autoscale deployment/nginx-deployment --min=10 --max=15 --cpu-percent=80

输出类似：

1	deployment.apps/nginx-deployment scaled

比例扩缩（Proportional Scaling）

RollingUpdate 类型的 Deployment 支持同时运行多个应用版本。当你（或自动扩缩器）对一个正处于滚动更新过程中（进行中或已暂停）的 Deployment 执行扩缩容操作时，Deployment 控制器会按比例将新增的副本分配到当前所有活跃的 ReplicaSet（即包含 Pod 的 ReplicaSet）中，以降低风险。这种机制称为 比例扩缩（Proportional Scaling）。

假设你有一个 Deployment，配置为：

replicas: 10
maxSurge: 3
maxUnavailable: 2

首先确认当前 10 个副本正在运行：

1	kubectl get deploy

输出类似：

1 2	NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE nginx-deployment 10 10 10 10 50s

现在你更新镜像，但新镜像在集群内无法解析（例如 nginx:sometag 不存在）：

1	kubectl set image deployment/nginx-deployment nginx=nginx:sometag

输出：

1	deployment.apps/nginx-deployment image updated

此更新会启动一个新的滚动更新，并创建新的 ReplicaSet nginx-deployment-1989198191，但由于新 Pod 无法就绪，滚动更新被阻塞（因为受限于 maxUnavailable=2，不能让超过 2 个 Pod 不可用）。

查看 ReplicaSet 状态：

1	kubectl get rs

输出类似：

1
2
3

NAME                          DESIRED   CURRENT   READY   AGE
nginx-deployment-1989198191   5         5         0       9s   # 新版本（卡住）
nginx-deployment-618515232    8         8         8       1m   # 旧版本（健康）

说明：Deployment 尝试将新 ReplicaSet 扩容到 5（因为 maxSurge=3，10+3=13，但受策略限制逐步推进），但新 Pod 无法就绪，因此更新停滞。

此时，一个新的扩缩容请求到来：自动扩缩器将 Deployment 的副本数增加到 15。

Deployment 控制器需要决定：这新增的 5 个副本该分配给谁？

如果没有比例扩缩：所有 5 个新副本都会加到新 ReplicaSet（即使它已卡住），可能导致更多不可用 Pod。
有了比例扩缩：新增副本会按当前各 ReplicaSet 的副本比例分配：
- 副本多的 ReplicaSet 获得更多新增副本；
- 副本少的获得较少；
- 未分配完的余数加给副本最多的 ReplicaSet；
- 副本数为 0 的 ReplicaSet 不会被扩容。

在本例中：

旧 ReplicaSet 有 8 个副本，新 ReplicaSet 有 5 个；
总活跃副本数 = 8 + 5 = 13；
新增 5 个副本按比例分配：
- 旧 ReplicaSet：5 × (8/13) ≈ 3
- 新 ReplicaSet：5 × (5/13) ≈ 2

因此，3 个加到旧 ReplicaSet，2 个加到新 ReplicaSet。

如果未来新镜像修复、Pod 变得健康，滚动更新将继续，最终所有副本都会迁移到新 ReplicaSet。

Deployment 滚动更新的暂停与恢复

当你更新一个 Deployment（或计划更新）时，可以在触发一个或多个变更之前暂停其滚动更新（rollout）。当你准备好应用这些变更时，再恢复滚动更新。

这种方法允许你在暂停与恢复之间应用多项修改，而不会触发不必要的中间滚动更新。

执行以下命令暂停滚动更新：

1	kubectl rollout pause deployment/nginx-deployment

输出：

1	deployment.apps/nginx-deployment paused

在暂停期间进行更新，例如更新容器镜像：

1	kubectl set image deployment/nginx-deployment nginx=nginx:1.16.1

输出：

1	deployment.apps/nginx-deployment image updated

注意：此时没有启动新的滚动更新。

验证 rollout 历史（确认修订版本未增加）：

1	kubectl rollout history deployment/nginx-deployment

输出类似，仍只有第 1 版，说明更新被暂存，未生效。

1
2
3

deployments "nginx"
REVISION  CHANGE-CAUSE
1         <none>

再检查 ReplicaSet，确认当前副本集未变化：

1	kubectl get rs

输出：

1 2	NAME DESIRED CURRENT READY AGE nginx-2142116321 3 3 3 2m

仍然是旧 ReplicaSet，没有新版本创建。

继续进行更多更新，你可以在暂停期间进行任意多次修改，例如更新资源请求：

1	kubectl set resources deployment/nginx-deployment -c=nginx --limits=cpu=200m,memory=512Mi

输出：

1	deployment.apps/nginx-deployment resource requirements updated

✅ 关键点：
在滚动更新暂停期间，Deployment 继续以原有状态正常运行，但所有对 Pod 模板（.spec.template）的修改都不会触发新 rollout，而是被累积起来。

当你完成所有修改后，恢复滚动更新：

1	kubectl rollout resume deployment/nginx-deployment

输出：

1	deployment.apps/nginx-deployment resumed

此时，Deployment 会一次性应用所有累积的变更，并创建一个全新的 ReplicaSet。

⚠️ 重要提示

在 Deployment 滚动更新处于暂停状态时，无法执行回滚（rollback）操作。
必须先 resume 恢复，才能使用 kubectl rollout undo。

主要的应用场景有：

场景	说明
批量修复	开发者在调试时，可暂停 Deployment，连续修改镜像、环境变量、资源限制等，最后一次性生效，避免多次滚动带来的服务抖动。
避免中间状态	防止在多个配置变更之间出现“半成品”版本（例如只改了镜像但没改资源配置）。
控制发布节奏	在 CI/CD 流水线中，可先暂停，等所有组件准备就绪后再统一发布。

ReplicaSet

ReplicaSet控制器的目的是维护一组稳定运行的Pod；它用来保证指定数量的、完全相同的 Pod 的可用性。

虽然 ReplicaSets 可以独立使用，但它主要被Deployments 用作协调 Pod 创建、删除和更新的机制。不需要单独去管理Deployment 所拥有的 ReplicaSet。

下面看一下ReplicaSetSpec的结构定义如下：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

// ReplicaSetSpec is the specification of a ReplicaSet.
type ReplicaSetSpec struct {
	// Replicas is the number of desired pods.
	Replicas *int32 `json:"replicas,omitempty" protobuf:"varint,1,opt,name=replicas"`

	// Minimum number of seconds for which a newly created pod should be ready
	// without any of its container crashing, for it to be considered available.
	// Defaults to 0 (pod will be considered available as soon as it is ready)
	MinReadySeconds int32 `json:"minReadySeconds,omitempty" protobuf:"varint,4,opt,name=minReadySeconds"`

	// Selector is a label query over pods that should match the replica count.
	// Label keys and values that must match in order to be controlled by this replica set.
	// It must match the pod template's labels.
	Selector *metav1.LabelSelector `json:"selector" protobuf:"bytes,2,opt,name=selector"`

	// Template is the object that describes the pod that will be created if
	// insufficient replicas are detected.
	Template v1.PodTemplateSpec `json:"template,omitempty" protobuf:"bytes,3,opt,name=template"`
}

可以看到ReplicaSetSpec的结构都是DeploymentSpec的子集，以下字段进行定义：

Selector（标签选择器）：用于识别哪些 Pod 归它管理；
Replicas（副本数）：指定应维持的 Pod 数量；
Pod 模板（Pod template）：定义在需要创建新 Pod 时应使用的配置。

ReplicaSet 会根据实际 Pod 数量与期望副本数的差异，动态创建或删除 Pod，以达到期望状态。当需要创建新 Pod 时，它会使用其内部的 Pod 模板。

ReplicaSet 通过 Pod 的 metadata.ownerReferences 字段与 Pod 建立关联。该字段指明了当前对象（Pod）的“所有者”（Owner）。所有被某个 ReplicaSet 管理的 Pod，其 ownerReferences 中都会包含该 ReplicaSet 的标识信息。正是通过这一链接，ReplicaSet 才能跟踪其所管理 Pod 的状态，并据此做出调度决策。

ReplicaSet 使用其 selector 来识别新出现的、可被它接管的 Pod。如果某个 Pod：

没有 ownerReference，或
其 ownerReference 不是一个控制器（Controller），并且 其标签匹配该 ReplicaSet 的 selector，那么该 Pod 会立即被该 ReplicaSet 接管。

ReplicaSet 能确保指定数量的 Pod 副本始终运行。然而，Deployment 是一个更高层次的抽象，它不仅管理 ReplicaSet，还为 Pod 提供声明式更新能力以及许多其他实用功能（如滚动更新、回滚、暂停/恢复等）。

因此，除非你有特殊需求（例如需要自定义更新编排逻辑，或根本不需要更新能力），否则我们强烈建议使用 Deployment，而不是直接操作 ReplicaSet。

实际上，这意味着：你很可能永远不需要直接创建或管理 ReplicaSet 对象——只需使用 Deployment，并在 Deployment 的 spec 中定义你的应用即可。

如下是一个创建ReplicaSet的示例：

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: frontend
  labels:
    app: guestbook
    tier: frontend
spec:
  # modify replicas according to your case
  replicas: 3
  selector:
    matchLabels:
      tier: frontend
  template:
    metadata:
      labels:
        tier: frontend
    spec:
      containers:
      - name: php-redis
        image: us-docker.pkg.dev/google-samples/containers/gke/gb-frontend:v5

StatefulSet

StatefulSet是用来管理有状态应用的对象。而Deployment和ReplicaSet用来管理无状态应用而设计的。StatefulSet用例管理一组Pod，并为这些Pod提供序号和唯一性保证；

和 Deployment相同的是，StatefulSet 管理了基于相同容器定义的一组 Pod。但和 Deployment 不同的是，StatefulSet 为它们的每个 Pod 维护了一个固定的 ID。这些 Pod 是基于相同的声明来创建的，但是不能相互替换：无论怎么调度，每个 Pod 都有一个永久不变的 ID。

如果你希望使用存储卷（storage volumes）为工作负载提供持久化能力，可以将 StatefulSet 作为解决方案的一部分。尽管 StatefulSet 中的单个 Pod 仍可能因故障而被替换，但其持久化的 Pod 标识使得系统能够轻松地将原有的存储卷重新关联到新创建的替代 Pod 上。

StatefulSets 的场景

StatefulSet 适用于需要以下一个或多个特性的应用：

稳定且唯一的网络标识符（例如 hostname、DNS 记录）；
稳定且持久的存储（Pod 重建后仍能挂载原来的卷）；
有序、优雅的部署与扩缩容（例如先启动 pod-0，再 pod-1，依此类推）；
有序、自动化的滚动更新（按顺序逐个更新 Pod）。

在上述语境中，“稳定”（stable）意味着：即使 Pod 被重新调度（如节点故障后重建），其标识和存储依然保持不变。

如果你的应用不需要任何稳定的标识符，也不要求有序的部署、删除或扩缩容，那么你应该使用更适合无状态应用的工作负载对象，例如 Deployment 或 ReplicaSet。

Limitations

存储必须预先配置：
每个 Pod 所需的存储卷，必须通过以下方式之一提供：
- 由 PersistentVolume Provisioner 根据指定的 StorageClass 动态创建；
- 或由集群管理员预先手动创建（pre-provisioned）。
删除或缩容不会删除存储卷：
当你删除 StatefulSet 或将其副本数调低时，与之关联的 PersistentVolume（PV）不会被自动删除。这是为了确保数据安全——数据的保留通常比自动清理所有资源更重要。
必须配合 Headless Service 使用：
StatefulSet 要求你事先创建一个无头服务（Headless Service，即 clusterIP: None 的 Service），用于管理 Pod 的网络身份（如 pod-name.service-name.namespace.svc.cluster.local）。
Kubernetes 不会自动创建该 Service，需用户自行定义。
删除 StatefulSet 不保证 Pod 有序终止：
直接删除 StatefulSet 不会按顺序优雅终止 Pod。
若需实现有序、优雅的终止，应先将副本数缩容至 0（kubectl scale statefulset <name> --replicas=0），再删除 StatefulSet。
滚动更新可能进入异常状态：
当使用默认的 Pod 管理策略（OrderedReady）进行滚动更新时，如果某个 Pod 更新失败卡住，整个更新流程会停滞，且 Kubernetes 不会自动修复，可能需要人工干预（如手动删除卡住的 Pod）。

如下是一个简单的StatefulSet的示例：

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3 # by default is 1
  minReadySeconds: 10 # by default is 0
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: registry.k8s.io/nginx-slim:0.24
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
      volumes:
      - name: www
        emptyDir: {}
  volumeClaimTemplates: []

`StatefulSetSpec`详解

下面我们直接看一下StatefulSetSpec的结构定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

// A StatefulSetSpec is the specification of a StatefulSet.
type StatefulSetSpec struct {
	// replicas is the desired number of replicas of the given Template.
	Replicas *int32 `json:"replicas,omitempty" protobuf:"varint,1,opt,name=replicas"`

	// selector is a label query over pods that should match the replica count.
	// It must match the pod template's labels.
	Selector *metav1.LabelSelector `json:"selector" protobuf:"bytes,2,opt,name=selector"`

	Template v1.PodTemplateSpec `json:"template" protobuf:"bytes,3,opt,name=template"`

	// volumeClaimTemplates is a list of claims that pods are allowed to reference.
	VolumeClaimTemplates []v1.PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty" protobuf:"bytes,4,rep,name=volumeClaimTemplates"`

	// serviceName is the name of the service that governs this StatefulSet.
	ServiceName string `json:"serviceName" protobuf:"bytes,5,opt,name=serviceName"`

	// podManagementPolicy controls how pods are created during initial scale up,
	// when replacing pods on nodes, or when scaling down. The default policy is
	// `OrderedReady`, where pods are created in increasing order (pod-0, then
	// pod-1, etc) and the controller will wait until each pod is ready before
	// continuing. When scaling down, the pods are removed in the opposite order.
	// The alternative policy is `Parallel` which will create pods in parallel
	// to match the desired scale without waiting, and on scale down will delete
	// all pods at once.
	PodManagementPolicy PodManagementPolicyType `json:"podManagementPolicy,omitempty" protobuf:"bytes,6,opt,name=podManagementPolicy,casttype=PodManagementPolicyType"`

	// updateStrategy indicates the StatefulSetUpdateStrategy that will be
	// employed to update Pods in the StatefulSet when a revision is made to
	// Template.
	UpdateStrategy StatefulSetUpdateStrategy `json:"updateStrategy,omitempty" protobuf:"bytes,7,opt,name=updateStrategy"`

	RevisionHistoryLimit *int32 `json:"revisionHistoryLimit,omitempty" protobuf:"varint,8,opt,name=revisionHistoryLimit"`

	// Minimum number of seconds for which a newly created pod should be ready
	// without any of its container crashing for it to be considered available.
	// Defaults to 0 (pod will be considered available as soon as it is ready)
	MinReadySeconds int32 `json:"minReadySeconds,omitempty" protobuf:"varint,9,opt,name=minReadySeconds"`

	// persistentVolumeClaimRetentionPolicy describes the lifecycle of persistent
	// volume claims created from volumeClaimTemplates. By default, all persistent
	// volume claims are created as needed and retained until manually deleted. 
	PersistentVolumeClaimRetentionPolicy *StatefulSetPersistentVolumeClaimRetentionPolicy `json:"persistentVolumeClaimRetentionPolicy,omitempty" protobuf:"bytes,10,opt,name=persistentVolumeClaimRetentionPolicy"`

	// ordinals controls the numbering of replica indices in a StatefulSet. The
	// default ordinals behavior assigns a "0" index to the first replica and
	// increments the index by one for each additional replica requested.
	Ordinals *StatefulSetOrdinals `json:"ordinals,omitempty" protobuf:"bytes,11,opt,name=ordinals"`
}

可以看到其结构中一些基本的字段和DeploymentSpec的含义是一样的，例如Replicas，Selector，PodTempalteSpec，MinReadySeconds，所以这里不再单独赘述了。

VolumeClaimTemplates

前面提到StatefulSet的定义需要一个PV来提供一个稳定且持久的存储，我们可以通过设置 .spec.volumeClaimTemplates 字段来创建 PersistentVolumeClaim（PVC）。当满足以下任一条件时，这将为 StatefulSet 提供稳定的持久化存储：

为该卷声明指定的 StorageClass 已配置为支持动态卷供应（dynamic provisioning）；
或者，集群中已存在一个具有正确 StorageClass 且可用存储空间充足的 PersistentVolume（PV）。

StatefulSet 中的每个 Pod 都拥有一个唯一的身份标识，该标识由以下三部分组成：

序号（Ordinal）
稳定的网络标识（Stable Network Identity）
稳定的存储（Stable Storage）

这个身份与 Pod 绑定，无论该 Pod 被调度（或重新调度）到哪个节点上，其身份始终保持不变。

Ordinals（序号）

对于一个副本数为 N 的 StatefulSet，其中的每个 Pod 都会被分配一个唯一的整数序号（ordinal），范围默认为 0 到 N-1。

StatefulSet 控制器还会自动为每个 Pod 添加一个标签：apps.kubernetes.io/pod-index，其值即为该 Pod 的序号。

.spec.ordinals 是一个可选字段，允许你自定义分配给 Pod 的序号。默认为 nil（即使用默认从 0 开始）。若设置了 .spec.ordinals.start，则 Pod 的序号将从该值开始，依次分配：**start 到 start + replicas - 1**。

特性状态：Kubernetes v1.31 [稳定]（默认启用）

例如：若 replicas=3 且 ordinals.start=100，则 Pod 序号为 100, 101, 102。

PodManagementPolicy（管理策略）

介绍Pod 管理策略前我们先看一下，Kubernetes默认针对StatefulSet提供的部署和扩缩容保证：

对于一个副本数为 N 的 StatefulSet，其 Pod 的创建和删除遵循严格的顺序规则：

部署（创建）时：Pod 按顺序从 0 到 N-1 依次创建。例如：web-0 → web-1 → web-2。
删除（终止）时：Pod 按逆序 从 N-1 到 0 依次终止。例如：web-2 → web-1 → web-0。
在创建某个 Pod 之前，它所有序号更小的前置 Pod（predecessors）必须Ready。如果设置了 .spec.minReadySeconds，则这些前置 Pod 必须已就绪并持续可用至少 minReadySeconds 秒。
在终止某个 Pod 之前，它所有序号更大的后继 Pod（successors）必须已经终止。

StatefulSet 不应将 Pod 的 terminationGracePeriodSeconds 设为 0。这种做法是不安全的，并强烈不建议使用。

示例说明（以 nginx StatefulSet 为例），当创建一个 replicas=3 的 StatefulSet 时：

Pod 按 web-0 → web-1 → web-2 顺序部署；
web-1 不会在 web-0 进入 Running 且 Ready 状态前启动；
web-2 不会在 web-1 Running 且 Ready 前启动。

故障场景：

假设 web-0 在 web-1 已就绪、但 web-2 尚未启动时发生故障，那么 web-2 的启动将被暂停，直到 web-0 被成功重建并重新变为 Running 且 Ready。

缩容场景：如果用户将 replicas 从 3 缩减为 1：

首先终止 web-2；
只有在 web-2 完全终止并删除后，才会终止 web-1；
如果此时 web-0 在 web-2 已删除但 web-1 尚未终止前发生故障，那么 web-1 的终止将被暂停，直到 web-0 恢复为 Running 且 Ready。

那我们现在来看一下PodManagementPolicyType的定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

type PodManagementPolicyType string
const (
	// OrderedReadyPodManagement will create pods in strictly increasing order on
	// scale up and strictly decreasing order on scale down, progressing only when
	// the previous pod is ready or terminated. At most one pod will be changed
	// at any time.
	OrderedReadyPodManagement PodManagementPolicyType = "OrderedReady"
	// ParallelPodManagement will create and delete pods as soon as the stateful set
	// replica count is changed, and will not wait for pods to be ready or complete
	// termination.
	ParallelPodManagement PodManagementPolicyType = "Parallel"
)

“OrderedReady”：就是默认的StatefulSets提供的严格的顺序规则，前面已经介绍过了；

“Parallel”：并行 Pod 管理策略会指示 StatefulSet 控制器并行地启动或终止所有 Pod，并且在启动或终止某个 Pod 时，不会等待其他 Pod 先达到 Running 且 Ready 状态，也不会等待其他 Pod 完全终止。

在扩缩容操作中，这意味着所有 Pod 会被同时创建或同时终止。
在滚动更新期间，如果 .spec.updateStrategy.rollingUpdate.maxUnavailable 的值大于 1，StatefulSet 控制器将同时终止并创建最多 maxUnavailable 个 Pod（这种行为也称为“突发更新”，bursting）。这种方式可以加快更新速度，但可能导致 Pod 以非顺序的方式变为就绪状态，因此不适用于对 Pod 启动或更新顺序有严格要求的应用。

UpdateStrategy

我们直接看一下StatefulSets针对Update Strategy的定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

type StatefulSetUpdateStrategy struct {
	// Type indicates the type of the StatefulSetUpdateStrategy.
	// Default is RollingUpdate.
	Type StatefulSetUpdateStrategyType `json:"type,omitempty" protobuf:"bytes,1,opt,name=type,casttype=StatefulSetStrategyType"`
  
	// RollingUpdate is used to communicate parameters when Type is RollingUpdateStatefulSetStrategyType.
	RollingUpdate *RollingUpdateStatefulSetStrategy `json:"rollingUpdate,omitempty" protobuf:"bytes,2,opt,name=rollingUpdate"`
}


type StatefulSetUpdateStrategyType string

const (
	RollingUpdateStatefulSetStrategyType StatefulSetUpdateStrategyType = "RollingUpdate"
	OnDeleteStatefulSetStrategyType StatefulSetUpdateStrategyType = "OnDelete"
)

// RollingUpdateStatefulSetStrategy is used to communicate parameter for RollingUpdateStatefulSetStrategyType.
type RollingUpdateStatefulSetStrategy struct {
	// Partition indicates the ordinal at which the StatefulSet should be partitioned
	// for updates. During a rolling update, all pods from ordinal Replicas-1 to
	// Partition are updated. All pods from ordinal Partition-1 to 0 remain untouched.
	// This is helpful in being able to do a canary based deployment. The default value is 0.
	Partition *int32 `json:"partition,omitempty" protobuf:"varint,1,opt,name=partition"`
	// The maximum number of pods that can be unavailable during the update.
	// Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
	// Absolute number is calculated from percentage by rounding up. This can not be 0.
	// Defaults to 1. This field is alpha-level and is only honored by servers that enable the
	// MaxUnavailableStatefulSet feature. The field applies to all pods in the range 0 to
	// Replicas-1. That means if there is any unavailable pod in the range 0 to Replicas-1, it will be counted towards MaxUnavailable.
	MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"varint,2,opt,name=maxUnavailable"`
}

该字段支持以下两种策略：

OnDelete（删除时更新）：当 .spec.updateStrategy.type 设为 OnDelete 时，StatefulSet 控制器不会自动更新 Pod。用户必须手动删除旧 Pod，控制器才会创建新 Pod（使用更新后的 .spec.template）。适用于需要完全手动控制更新节奏的场景。
RollingUpdate（滚动更新）：这是 StatefulSet 的默认更新策略。它会自动执行滚动更新：逐个删除并重建 Pod，以应用新的模板配置。

当使用 RollingUpdate 策略时：

StatefulSet 控制器会按逆序（从最大序号到最小序号）依次删除并重建每个 Pod。
在更新下一个 Pod（序号更小的）之前，Kubernetes 控制平面会等待当前更新的 Pod 变为 Running 且 Ready。
如果设置了 .spec.minReadySeconds（参见“最小就绪秒数”），控制平面还会在 Pod 就绪后再等待指定秒数，才继续更新前一个 Pod。

RollingUpdate 策略支持设置更细粒度的滚动更新方式，我们看到RollingUpdateStatefulSetStrategy结构包含了两个字段Partition和MaxUnavailable，其中Partition表示分区滚动更新。

你可以通过设置 .spec.updateStrategy.rollingUpdate.partition 来对滚动更新进行分区控制：

当 .spec.template 更新后：
- 序号 ≥ partition 的 Pod 会被更新；
- 序号 < partition 的 Pod 不会被更新；
- 即使这些低序号 Pod 被手动删除，也会以旧版本模板重新创建。
特殊情形：
如果 partition 的值 大于 .spec.replicas，则任何 Pod 都不会被更新（因为没有 Pod 的序号 ≥ partition）。
典型用途：
- 分阶段发布（phased rollout）；
- 金丝雀发布（canary release）：先更新高序号 Pod（如 web-2）进行验证；
- 灰度测试：仅更新部分实例。

💡 大多数情况下无需使用分区，但在需要精细控制更新范围时非常有用。

MaxUnavailable的含义和Deployment中的一样，代表最大不可用 Pod 数：

特性状态：Kubernetes v1.35 [Beta]（默认启用）

你可以通过 .spec.updateStrategy.rollingUpdate.maxUnavailable 字段控制更新过程中允许的最大不可用 Pod 数量，值可以是：

绝对数值（如 5）；
百分比（如 10%），系统会向上取整计算绝对值。

该值不能为 0；**默认值为 1**。

📌 此限制适用于序号范围 [0, replicas - 1] 内的所有 Pod。只要该范围内的 Pod 处于不可用状态（如 NotReady、Pending、Terminating），就会计入 maxUnavailable。

当使用默认的 Pod 管理策略（OrderedReady）进行滚动更新时，可能陷入需要人工干预的异常状态：

场景：
如果你将 Pod 模板更新为一个永远无法变为 Running 和 Ready 的配置（例如：镜像错误、应用配置错误等），StatefulSet 会停止滚动更新并持续等待。
问题：
即使你将模板改回正确的配置，StatefulSet 仍不会自动恢复！由于一个已知问题，控制器会继续等待那个已损坏的 Pod 变为 Ready（而这永远不会发生），因此不会尝试用新模板重建它。
解决方法：
在恢复模板后，必须手动删除那些已使用错误配置创建的 Pod。之后，StatefulSet 才会使用正确的模板重新创建这些 Pod，完成回滚。

PersistentVolumeClaimRetentionPolicy

持久卷声明（PVC）保留策略：

特性状态：Kubernetes v1.32 [稳定]（默认启用）

可选字段 .spec.persistentVolumeClaimRetentionPolicy 用于控制在 StatefulSet 生命周期中，是否以及如何删除由其 volumeClaimTemplate 创建的 PersistentVolumeClaim（PVC）。

⚠️ 前提条件：
必须在 API Server 和 Controller Manager 上启用 StatefulSetAutoDeletePVC 特性门控（feature gate），才能使用此字段。
（注：从 v1.32 起该特性默认启用，通常无需手动开启。）

结构定义如下：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

type PersistentVolumeClaimRetentionPolicyType string

const (
	RetainPersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "Retain"

	DeletePersistentVolumeClaimRetentionPolicyType PersistentVolumeClaimRetentionPolicyType = "Delete"
)

// StatefulSetPersistentVolumeClaimRetentionPolicy describes the policy used for PVCs
// created from the StatefulSet VolumeClaimTemplates.
type StatefulSetPersistentVolumeClaimRetentionPolicy struct {
	WhenDeleted PersistentVolumeClaimRetentionPolicyType `json:"whenDeleted,omitempty" protobuf:"bytes,1,opt,name=whenDeleted,casttype=PersistentVolumeClaimRetentionPolicyType"`

	WhenScaled PersistentVolumeClaimRetentionPolicyType `json:"whenScaled,omitempty" protobuf:"bytes,2,opt,name=whenScaled,casttype=PersistentVolumeClaimRetentionPolicyType"`
}

启用后，可为每个 StatefulSet 配置以下两种策略：

whenDeleted：控制 StatefulSet 被删除时 PVC 的保留行为。
whenScaled：控制 StatefulSet 副本数减少（例如缩容）时 PVC 的保留行为。

每个策略可设为以下两种值之一：

Delete：受策略影响的 Pod 对应的 PVC 会被删除。
- 对于 whenDeleted：当 StatefulSet 被删除时，所有由该 StatefulSet 的 volumeClaimTemplate 创建的 PVC 都会在其 Pod 删除后被删除。
- 对于 whenScaled：仅缩容所涉及的那些 Pod 对应的 PVC 会在 Pod 删除后被删除。
Retain（默认值）：PVC 不会被自动删除，即使其关联的 Pod 已被删除。这是 Kubernetes 在引入此功能之前的行为，确保数据安全。

🔔 注意：
这些策略仅在因 StatefulSet 被删除或缩容而移除 Pod 时生效。

例如：若 Pod 因节点故障而重建，StatefulSet 会保留原有 PVC，并将原有卷挂载到新 Pod 所在节点——PVC 和底层存储完全不受影响。

默认策略为 Retain，以保持向后兼容。

示例配置

apiVersion: apps/v1
kind: StatefulSet
...
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain   # 删除 StatefulSet 时保留 PVC
    whenScaled: Delete    # 缩容时自动删除对应 PVC
...

实现机制

StatefulSet 控制器会为 PVC 添加 ownerReference（所有者引用）。
当 Pod 被终止后，垃圾回收器（Garbage Collector），从而确保：
- Pod 能先干净地卸载所有卷；
- 然后再删除 PVC（以及后续根据 PV 的回收策略决定是否删除底层存储）。

关于 whenDeleted: Delete

控制器会将 StatefulSet 实例本身 设置为所有关联 PVC 的 owner。
当 StatefulSet 被删除时，所有 PVC 作为其子资源被级联删除。

关于 whenScaled: Delete

仅当 Pod 因缩容（其序号 ≥ 新副本数）被标记为“待删除”时，控制器才会：
1. 将该 Pod 设置为对应 PVC 的 owner；
2. 然后删除 Pod；
3. Pod 终止后，垃圾回收器删除其拥有的 PVC。

✅ 这样确保：只有因缩容而被删除的 Pod，其 PVC 才会被清理；因故障重建的 Pod 不受影响。

Pod Identity

StatefulSet 中的每个 Pod 都拥有一个唯一的身份标识，该标识由以下三部分组成：

序号（Ordinal）
稳定的网络标识（Stable Network Identity）
稳定的存储（Stable Storage）

这个身份与 Pod 绑定，无论该 Pod 被调度（或重新调度）到哪个节点上，其身份始终保持不变。

StatefulSet 通过 序号 + 固定网络名 + 独立持久卷 三位一体，为有状态应用提供可预测、可重建、可识别的运行环境。其核心价值在于：

身份稳定：Pod 重建后仍保留原名、原存储、原网络标识；
顺序保障：启动、更新、终止均按序进行；
运维友好：通过标准标签（如 pod-index）支持精细化管理。

💡 最佳实践：始终搭配 Headless Service 使用，并谨慎管理 PV 生命周期以确保数据安全。

Ordinal在前面介绍StatefulSetSpec结构的时候已经介绍过了，我们看一下，另外两个部分：

稳定的网络标识（Stable Network ID）

每个 StatefulSet Pod 的 hostname 由 StatefulSet 名称和 Pod 序号共同决定，格式为：

1	$(statefulset-name)-$(ordinal)

例如，名为 web 的 StatefulSet 创建 3 个副本，将生成 Pod： web-0、web-1、web-2。

StatefulSet 必须配合一个 Headless Service（无头服务，clusterIP: None）使用，以管理 Pod 的网络域。该 Service 所管理的域格式为：

1	$(service-name).$(namespace).svc.$(cluster-domain)

其中 cluster-domain 默认为 cluster.local。

每个 Pod 会获得一个对应的 DNS 子域名，格式为：

1	$(pod-name).$(service-name).$(namespace).svc.$(cluster-domain)

🔔 注意：
由于 DNS 中存在负缓存（negative caching），在 Pod 刚创建时，其他客户端可能无法立即解析其 DNS 名称。

即使 Pod 已运行，之前失败的 DNS 查询结果可能仍被缓存数秒（CoreDNS 默认缓存 30 秒）。

如需快速发现新 Pod，可考虑：

直接通过 Kubernetes API 查询（如使用 watch 机制），而非依赖 DNS；
缩短 DNS 缓存时间（例如修改 CoreDNS ConfigMap 中的缓存策略）。

⚠️ 再次强调：你必须自行创建 Headless Service，Kubernetes 不会自动创建。

以下是关于集群域（Cluster Domain）、Service 名称、StatefulSet 名称的一些配置示例，以及这些配置如何影响 StatefulSet 中 Pod 的 DNS 名称。

集群域（Cluster Domain）	Service（命名空间/名称）	StatefulSet（命名空间/名称）	StatefulSet 域	Pod DNS 名称	Pod Hostname
`cluster.local`	`default/nginx`	`default/web`	`nginx.default.svc.cluster.local`	`web-{0..N-1}.nginx.default.svc.cluster.local`	`web-{0..N-1}`
`cluster.local`	`foo/nginx`	`foo/web`	`nginx.foo.svc.cluster.local`	`web-{0..N-1}.nginx.foo.svc.cluster.local`	`web-{0..N-1}`
`kube.local`	`foo/nginx`	`foo/web`	`nginx.foo.svc.kube.local`	`web-{0..N-1}.nginx.foo.svc.kube.local`	`web-{0..N-1}`

📌 默认集群域为 cluster.local，除非另行配置。

稳定的存储（Stable Storage）

StatefulSet 中每定义一个 volumeClaimTemplates 条目，每个 Pod 就会自动获得一个对应的 PersistentVolumeClaim（PVC）。

以 nginx 示例为例：

每个 Pod 会绑定一个 独立的 PersistentVolume（PV）；
使用指定的 StorageClass（如 my-storage-class）；
分配 1 GiB 存储空间；
若未指定 StorageClass，则使用集群默认的 StorageClass。

当 Pod 被（重新）调度到某节点时，其 volumeMounts 会自动挂载与其 PVC 关联的 PV。

⚠️ 重要：
删除 Pod 或 StatefulSet 时，关联的 PersistentVolume 不会被自动删除。这是为了防止数据意外丢失。如需清理，必须手动删除 PVC 和 PV。

Pod 名称标签（Pod Name Label）

StatefulSet 控制器在创建 Pod 时，会自动添加一个标签： statefulset.kubernetes.io/pod-name，其值为该 Pod 的完整名称（如 web-0）。

此标签可用于将 Service 精确绑定到某个特定 Pod（例如用于调试或特殊路由）。

Pod 序号标签（Pod Index Label）

特性状态：Kubernetes v1.32 [稳定]（默认启用）

StatefulSet 控制器还会为每个 Pod 添加标签： apps.kubernetes.io/pod-index，其值为 Pod 的序号（整数）。

此标签可用于：

将流量路由到特定序号的 Pod；
按序号筛选日志或指标；
在监控或自动化脚本中识别 Pod 顺序。

🔒 此功能由特性门控 PodIndexLabel 控制，默认已启用且锁定。若要禁用，必须使用服务器模拟的 v1.31 版本（通常不建议）。

DaemonSet

DaemonSet 用于定义提供节点本地功能的 Pod。这些功能可能是集群运行所必需的基础组件（例如网络辅助工具），也可能是某个插件（add-on）的一部分。

DaemonSet控制器的设计是为了实现全部/部分节点上都运行一个Pod的副本。当有节点加入集群时，自动为新增一个 Pod 。当有节点从集群移除时，这些 Pod 也会被回收。删除 DaemonSet 将会删除它创建的所有 Pod。

DaemonSet典型的用法:

在每个节点上运行集群守护进程
在每个节点上运行日志收集守护进程
在每个节点上运行监控守护进程

在简单场景中，每种类型的守护进程通常对应一个 DaemonSet，并覆盖所有节点。在更复杂的场景中，同一种守护进程可能会使用多个 DaemonSet，通过不同的启动参数、或为不同硬件类型的节点配置不同的 CPU 和内存资源请求。

我们看一下DaemonSetSpec的结构定义，和其他的Workload Resource的结构基本一致，基础的可参考DeploymentSetSpec的结构说明；

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/apps/v1/types.go

type DaemonSetSpec struct {
	// A label query over pods that are managed by the daemon set.
	// Must match in order to be controlled.
	// It must match the pod template's labels.
	Selector *metav1.LabelSelector `json:"selector" protobuf:"bytes,1,opt,name=selector"`

	// An object that describes the pod that will be created.
	Template v1.PodTemplateSpec `json:"template" protobuf:"bytes,2,opt,name=template"`

	// An update strategy to replace existing DaemonSet pods with new pods.
	UpdateStrategy DaemonSetUpdateStrategy `json:"updateStrategy,omitempty" protobuf:"bytes,3,opt,name=updateStrategy"`

	// The minimum number of seconds for which a newly created DaemonSet pod should
	// be ready without any of its container crashing, for it to be considered
	// available. Defaults to 0 (pod will be considered available as soon as it
	// is ready).
	MinReadySeconds int32 `json:"minReadySeconds,omitempty" protobuf:"varint,4,opt,name=minReadySeconds"`

	// The number of old history to retain to allow rollback.
	// This is a pointer to distinguish between explicit zero and not specified.
	// Defaults to 10.
	RevisionHistoryLimit *int32 `json:"revisionHistoryLimit,omitempty" protobuf:"varint,6,opt,name=revisionHistoryLimit"`
}

Daemon Pod 的调度方式

如果你在 DaemonSet 中指定了 .spec.template.spec.nodeSelector，那么 DaemonSet 控制器将在匹配该节点选择器（node selector）的节点上创建 Pod。同样，如果你指定了 .spec.template.spec.affinity，则 DaemonSet 控制器会在匹配该节点亲和性（node affinity）的节点上创建 Pod。如果你两者都没有指定，DaemonSet 控制器将在所有节点上创建 Pod。

DaemonSet 可用于确保所有符合条件的节点都运行一个 Pod 副本。DaemonSet 控制器会为每个符合条件的节点创建一个 Pod，并在该 Pod 的 spec.affinity.nodeAffinity 字段中设置节点亲和性，使其匹配目标主机。

Pod 创建完成后，通常由默认调度器（default scheduler）接管，并通过设置 Pod 的 .spec.nodeName 字段将其绑定到目标节点。如果新 Pod 无法适配该节点的资源（例如 CPU 或内存不足），默认调度器可能会根据新 Pod 的优先级，抢占（驱逐） 节点上某些已有的 Pod。

注意：

如果要求 DaemonSet 的 Pod 必须在每个节点上运行，通常建议将 DaemonSet 的 .spec.template.spec.priorityClassName 设置为一个高优先级的 PriorityClass，以确保在必要时能够成功驱逐低优先级 Pod。

用户可以通过设置 DaemonSet 的 .spec.template.spec.schedulerName 字段，为 DaemonSet 的 Pod 指定一个不同的调度器。

此外，DaemonSet 控制器在评估哪些节点符合条件时，会考虑用户在 .spec.template.spec.affinity.nodeAffinity 中原始指定的节点亲和性（如果有的话）。但在实际创建的 Pod 中，该亲和性会被替换为一个精确匹配目标节点名称的节点亲和性规则。

污点（Taints）与容忍（Tolerations）

DaemonSet 控制器会自动为 DaemonSet 的 Pod 添加一组容忍（tolerations）：

容忍键（Toleration Key）	效果（Effect）	说明
`node.kubernetes.io/not-ready`	`NoExecute`	DaemonSet Pod 可以被调度到尚未就绪或不健康的节点上；即使节点变为 not-ready 状态，已运行的 DaemonSet Pod 也不会被驱逐。
`node.kubernetes.io/unreachable`	`NoExecute`	DaemonSet Pod 可以被调度到节点控制器无法访问的节点上；运行在此类节点上的 DaemonSet Pod 不会被驱逐。
`node.kubernetes.io/disk-pressure`	`NoSchedule`	DaemonSet Pod 可以被调度到存在磁盘压力（磁盘空间不足）的节点上。
`node.kubernetes.io/memory-pressure`	`NoSchedule`	DaemonSet Pod 可以被调度到存在内存压力的节点上。
`node.kubernetes.io/pid-pressure`	`NoSchedule`	DaemonSet Pod 可以被调度到进程 ID 耗尽（进程数过多）的节点上。
`node.kubernetes.io/unschedulable`	`NoSchedule`	DaemonSet Pod 可以被调度到被标记为“不可调度”（unschedulable）的节点上。
`node.kubernetes.io/network-unavailable`	`NoSchedule`	仅对使用主机网络（`spec.hostNetwork: true`）的 DaemonSet Pod 添加。这类 Pod 可以被调度到网络不可用的节点上。

你也可以在 DaemonSet 的 Pod 模板中自行添加额外的容忍规则。

由于 DaemonSet 控制器自动添加了 node.kubernetes.io/unschedulable:NoSchedule 容忍，Kubernetes 允许 DaemonSet Pod 在被标记为“不可调度”的节点上运行。

如果你使用 DaemonSet 来提供关键的节点级功能（例如集群网络插件），那么 Kubernetes 在节点尚未就绪之前就调度 DaemonSet Pod 就显得非常重要。例如：如果没有这项特殊的容忍机制，就可能出现死锁——节点因为网络插件未运行而无法变为“就绪”状态，而网络插件又因为节点未就绪（被标记为 unschedulable 或 not-ready）而无法被调度到该节点上。

DaemonSet 通常用于部署基础设施级的关键组件（如 CNI 网络插件、日志收集器、监控代理等）。这些组件必须在节点加入集群后立即运行，哪怕节点还处于“不健康”或“未就绪”状态。

普通 Pod 会被节点上的污点（taints）阻挡（例如节点刚启动时带有 node.kubernetes.io/not-ready: NoSchedule）。
但 DaemonSet Pod 自带容忍，可以“无视”这些污点，确保关键系统组件能先跑起来，从而帮助节点最终变为就绪状态。

DaemonSet 的自动容忍机制是 Kubernetes 自举（bootstrap）和高可用设计的关键一环，确保基础设施组件总能先于普通应用运行，即使在节点状态异常时也能保持运行，避免系统级故障。

Jobs

Job（任务） 代表一次性运行的任务，这些任务会执行到完成然后停止。

Job 会创建一个或多个 Pod，并持续重试这些 Pod 的执行，直到指定数量的 Pod 成功终止。当 Pod 成功完成时，Job 会跟踪记录成功完成的数量。一旦达到指定的成功完成数量，该任务（即 Job）即视为完成。删除一个 Job 后，它所创建的所有 Pod 也会被自动清理。暂停（suspend）一个 Job 会删除其当前正在运行的 Pod，直到 Job 被恢复（resume）。

你会创建一个 Job 对象以便以一种可靠的方式运行某 Pod 直到完成。当第一个 Pod 失败或者被删除（比如因为节点硬件失效或者重启）时，Job 对象会启动一个新的 Pod。

Job 与ReplicationController是彼此互补的。**ReplicationController管理的是那些不希望被终止的 Pod (Web 服务器)， Job 管理的是那些希望被终止的 Pod(批处理作业)**。

`JobSpec`详解

我们看一下JobSpec的结构定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/batch/v1/types.go

// JobSpec describes how the job execution will look like.
type JobSpec struct {
	// Specifies the maximum desired number of pods the job should
	// run at any given time. The actual number of pods running in steady state will
	// be less than this number when ((.spec.completions - .status.successful) < .spec.parallelism),
	Parallelism *int32 `json:"parallelism,omitempty" protobuf:"varint,1,opt,name=parallelism"`

	// Specifies the desired number of successfully finished pods the
	// job should be run with. 
	Completions *int32 `json:"completions,omitempty" protobuf:"varint,2,opt,name=completions"`

	// Specifies the duration in seconds relative to the startTime that the job
	// may be continuously active before the system tries to terminate it; value
	// must be positive integer. If a Job is suspended (at creation or through an
	// update), this timer will effectively be stopped and reset when the Job is
	// resumed again.
	ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty" protobuf:"varint,3,opt,name=activeDeadlineSeconds"`

	// Specifies the policy of handling failed pods. In particular, it allows to
	// specify the set of actions and conditions which need to be
	// satisfied to take the associated action.
	// If empty, the default behaviour applies - the counter of failed pods,
	// represented by the jobs's .status.failed field, is incremented and it is
	// checked against the backoffLimit. This field cannot be used in combination
	// with restartPolicy=OnFailure.
	PodFailurePolicy *PodFailurePolicy `json:"podFailurePolicy,omitempty" protobuf:"bytes,11,opt,name=podFailurePolicy"`

	// successPolicy specifies the policy when the Job can be declared as succeeded.
	// If empty, the default behavior applies - the Job is declared as succeeded
	// only when the number of succeeded pods equals to the completions.
	// When the field is specified, it must be immutable and works only for the Indexed Jobs.
	// Once the Job meets the SuccessPolicy, the lingering pods are terminated.
	SuccessPolicy *SuccessPolicy `json:"successPolicy,omitempty" protobuf:"bytes,16,opt,name=successPolicy"`

	// Specifies the number of retries before marking this job failed.
	// Defaults to 6, unless backoffLimitPerIndex (only Indexed Job) is specified.
	// When backoffLimitPerIndex is specified, backoffLimit defaults to 2147483647.
	BackoffLimit *int32 `json:"backoffLimit,omitempty" protobuf:"varint,7,opt,name=backoffLimit"`

	BackoffLimitPerIndex *int32 `json:"backoffLimitPerIndex,omitempty" protobuf:"varint,12,opt,name=backoffLimitPerIndex"`

	MaxFailedIndexes *int32 `json:"maxFailedIndexes,omitempty" protobuf:"varint,13,opt,name=maxFailedIndexes"`

	Selector *metav1.LabelSelector `json:"selector,omitempty" protobuf:"bytes,4,opt,name=selector"`

	ManualSelector *bool `json:"manualSelector,omitempty" protobuf:"varint,5,opt,name=manualSelector"`

	// Describes the pod that will be created when executing a job.
	// The only allowed template.spec.restartPolicy values are "Never" or "OnFailure".
	// More info: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
	Template corev1.PodTemplateSpec `json:"template" protobuf:"bytes,6,opt,name=template"`

	TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty" protobuf:"varint,8,opt,name=ttlSecondsAfterFinished"`

	CompletionMode *CompletionMode `json:"completionMode,omitempty" protobuf:"bytes,9,opt,name=completionMode,casttype=CompletionMode"`

	Suspend *bool `json:"suspend,omitempty" protobuf:"varint,10,opt,name=suspend"`

	// podReplacementPolicy specifies when to create replacement Pods.
	PodReplacementPolicy *PodReplacementPolicy `json:"podReplacementPolicy,omitempty" protobuf:"bytes,14,opt,name=podReplacementPolicy,casttype=podReplacementPolicy"`

	// ManagedBy field indicates the controller that manages a Job.
	ManagedBy *string `json:"managedBy,omitempty" protobuf:"bytes,15,opt,name=managedBy"`
}

Parallelism并行度

并行度参数（.spec.parallelism）定义了 Job 并发运行的 Pod 最大数量，可设为任意非负整数。如果未指定，默认值为 1。如果设为 0，则 Job 实际上处于暂停状态，直到该值被调高。

实际并行度（即任一时刻正在运行的 Pod 数量）可能高于或低于请求的并行度，原因包括：

对于固定完成次数的 Job，实际并行运行的 Pod 数量不会超过剩余需完成的数量。此时，更高的 .spec.parallelism 值会被忽略。
对于工作队列类型的 Job，一旦有任一 Pod 成功终止，就不再启动新 Pod（但已存在的 Pod 仍可继续完成）。
Job 控制器尚未及时响应。
Job 控制器因某些原因（如 ResourceQuota 不足、权限不足等）未能创建 Pod，导致实际 Pod 数量少于请求值。
Job 控制器因同一 Job 中先前 Pod 频繁失败而限制新 Pod 的创建。
Pod 正在优雅终止（graceful shutdown），需要时间完成停止过程。

Completions完成数

定义整个 Job 成功完成所需的 Pod 成功总数。有三种主要类型的任务适合以 Job 形式运行：

非并行 Job（Non-parallel Jobs）
- 通常只启动一个 Pod，除非该 Pod 失败。
- 一旦其 Pod 成功终止，Job 即视为完成。
具有固定完成次数的并行 Job（Parallel Jobs with a fixed completion count）
- 在 .spec.completions 中指定一个非零正整数。
- Job 表示整体任务，当成功完成的 Pod 数量达到 .spec.completions 时，Job 完成。
- 若使用 .spec.completionMode="Indexed"，每个 Pod 会获得一个唯一的索引（index），范围从 0 到 .spec.completions - 1。
基于工作队列的并行 Job（Parallel Jobs with a work queue）
- 不指定 .spec.completions（默认不设），而依赖 .spec.parallelism。
- 各 Pod 必须自行协调（或通过外部服务）来决定各自处理哪些任务。例如，一个 Pod 可能从工作队列中取出最多 N 个任务项。
- 每个 Pod 必须能独立判断是否所有对等 Pod 都已完成，从而确定整个 Job 是否完成。
- 一旦 Job 中任意一个 Pod 成功终止，就不再创建新 Pod。
- 当至少有一个 Pod 成功终止，且所有 Pod 都已终止时，Job 即视为成功完成。
- 一旦有任一 Pod 成功退出，其他 Pod 不应再执行任何工作或写入输出，而应处于退出过程中。

对于非并行 Job，你可以同时不设置 .spec.completions 和 .spec.parallelism。当两者均未设置时，默认值均为 1。

对于固定完成次数的 Job，应将 .spec.completions 设置为所需的完成次数。你可以设置 .spec.parallelism，也可以不设置（默认为 1）。

对于工作队列类型的 Job，必须不设置 .spec.completions，并将 .spec.parallelism 设置为一个非负整数。

Completion Mode

特性状态：Kubernetes v1.24 [稳定]

具有固定完成次数的 Job（即 .spec.completions 非空的 Job）可通过 .spec.completionMode 指定完成模式：

NonIndexed（默认）：当成功完成的 Pod 数量达到 .spec.completions 时，Job 即视为完成。即每个 Pod 的完成是等效的。注意：.spec.completions 为 null 的 Job 隐式视为 NonIndexed。
Indexed

Job 中的每个 Pod 会被分配一个唯一的完成索引（completion index），范围从 0 到 .spec.completions - 1。该索引可通过以下四种方式获取：
1. Pod 注解：batch.kubernetes.io/job-completion-index
2. Pod 标签：batch.kubernetes.io/job-completion-index（Kubernetes v1.28 及以后版本）。注意：需启用 PodIndexLabel 特性门控（默认已启用）。
3. 作为 Pod 主机名的一部分，格式为$(job-name)-$(index)。当 Indexed Job 与 Service 结合使用时，Job 内的 Pod 可通过确定性的主机名经由 DNS 相互寻址。配置方法详见“支持 Pod 间通信的 Job”文档。
4. 在容器任务中，通过环境变量 JOB_COMPLETION_INDEX 获取。

当每个索引都有一个成功完成的 Pod 时，Job 即视为完成。有关该模式的使用方法，请参阅“用于静态任务分配的 Indexed Job”一节。

注意：
尽管罕见，但可能因节点故障、kubelet 重启或 Pod 被驱逐等原因，为同一索引启动多个 Pod。此时，只有第一个成功完成的 Pod会计入完成数量并更新 Job 状态。Job 控制器在检测到其他为相同索引运行或已完成的 Pod 后，会将其删除。

ActiveDeadlineSeconds

改字段表示活跃截止时间（active deadline），在 Job 的 .spec.activeDeadlineSeconds 字段中指定一个秒数。该时间限制适用于整个 Job 的生命周期，无论期间创建了多少个 Pod。

一旦 Job 运行时间达到 activeDeadlineSeconds，其所有正在运行的 Pod 都会被终止，Job 状态将变为：

注意：
Job 的 .spec.activeDeadlineSeconds 优先级高于 .spec.backoffLimit。即使重试次数尚未达到 backoffLimit，只要 Job 达到 activeDeadlineSeconds 设定的时间上限，就不会再创建新的 Pod。

BackoffLimit退避次数限制

在某些情况下，由于配置逻辑错误等原因，你可能希望在重试一定次数后将 Job 标记为失败。为此，可通过设置 .spec.backoffLimit 来指定在将 Job 视为失败前允许的最大重试次数。

除非为 Indexed Job 显式指定了每个索引的退避限制（.spec.backoffLimitPerIndex），否则 .spec.backoffLimit 默认值为 6。当指定了 .spec.backoffLimitPerIndex 时，.spec.backoffLimit 默认值为 2147483647（即 MaxInt32）。

与该 Job 关联的失败 Pod 会被 Job 控制器以指数退避延迟方式重新创建（10 秒、20 秒、40 秒……），最大延迟被限制为 6 分钟。

重试次数的计算方式有两种：

状态为 .status.phase = "Failed" 的 Pod 数量。
当使用 restartPolicy = "OnFailure" 时，状态为 Pending 或 Running 的 Pod 中，其容器的重启次数总和。

只要上述任一计算结果达到 .spec.backoffLimit，Job 就会被视为失败。

注意：
如果你的 Job 使用了 restartPolicy = "OnFailure"，请注意：一旦达到 Job 的退避限制，运行该任务的 Pod 将被终止。这可能会使调试 Job 可执行程序变得更加困难。建议在调试 Job 时将 restartPolicy 设置为 "Never"，或使用日志系统确保失败 Job 的输出不会意外丢失。

BackoffLimitPerIndex：每索引退避限制。

特性状态：Kubernetes v1.33 [稳定]（默认启用）

当你运行一个 Indexed Job 时，可以选择为每个索引独立处理 Pod 失败的重试逻辑。为此，可设置 .spec.backoffLimitPerIndex，指定每个索引允许的最大失败 Pod 数。

当某个索引的失败次数超过其 backoffLimitPerIndex 限制时，Kubernetes 会将该索引标记为失败，并将其添加到 .status.failedIndexes 字段中。无论是否设置了 backoffLimitPerIndex，成功执行的索引（即有成功 Pod 的索引）都会记录在 .status.completedIndexes 字段中。

请注意，某个索引的失败不会中断其他索引的执行。当指定了 backoffLimitPerIndex 的 Job 中所有索引都完成执行后，只要其中至少有一个索引失败，Job 控制器就会将整个 Job 标记为失败（通过在状态中设置 Failed 条件）。即使大部分（甚至几乎全部）索引都成功执行，Job 仍会被标记为失败。

你还可以通过设置 .spec.maxFailedIndexes 字段，进一步限制允许失败的索引最大数量。当失败索引数量超过 maxFailedIndexes 时，Job 控制器会触发终止该 Job 所有仍在运行的 Pod。当所有 Pod 终止后，Job 控制器会将整个 Job 标记为失败（在 Job 状态中设置 Failed 条件）。

如下示例展示了上面几个参数的使用方式，Job 控制器允许每个索引重试一次。当失败索引总数超过 5 时，整个 Job 将被终止：

# /controllers/job-backoff-limit-per-index-example.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: job-backoff-limit-per-index-example
spec:
  completions: 10
  parallelism: 3
  completionMode: Indexed  # 使用此特性必须设置
  backoffLimitPerIndex: 1  # 每个索引最多失败 1 次（即允许 1 次重试）
  maxFailedIndexes: 5      # 失败索引数超过 5 时，终止整个 Job
  template:
    spec:
      restartPolicy: Never # 使用此特性必须设置
      containers:
      - name: example
        image: python
        command:
        - python3
        - -c
        - |
          import os, sys
          print("Hello world")
          if int(os.environ.get("JOB_COMPLETION_INDEX")) % 2 == 0:
            sys.exit(1)

PodFailurePolicy失败策略

特性状态：Kubernetes v1.31 [稳定]（默认启用）

通过 .spec.podFailurePolicy 字段定义的 Pod 失败策略，允许你的集群根据容器退出码（exit codes）和 Pod 状态条件（Pod conditions）来处理 Pod 失败。

在某些场景下，你可能希望比基于 .spec.backoffLimit 的 Pod 退避失败策略拥有更精细的控制能力。以下是一些典型用例：

优化运行成本：当某个 Pod 因退出码表明存在软件缺陷（如 bug）而失败时，立即终止整个 Job，避免不必要的 Pod 重启。
保证 Job 完成：忽略由干扰（如抢占、API 触发的驱逐、污点驱逐等）导致的 Pod 失败，使其不计入.spec.backoffLimit 的重试次数。

你可以在 .spec.podFailurePolicy 字段中配置 Pod 失败策略，以满足上述需求。该策略可根据容器退出码和 Pod 状态条件处理失败。

如下是关于 .spec.podFailurePolicy 的结构定义：

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/batch/v1/types.go

type PodFailurePolicyAction string
const (
	PodFailurePolicyActionFailJob PodFailurePolicyAction = "FailJob"
	PodFailurePolicyActionFailIndex PodFailurePolicyAction = "FailIndex"
	PodFailurePolicyActionIgnore PodFailurePolicyAction = "Ignore"
	PodFailurePolicyActionCount PodFailurePolicyAction = "Count"
)

type PodFailurePolicyOnExitCodesOperator string
const (
	PodFailurePolicyOnExitCodesOpIn    PodFailurePolicyOnExitCodesOperator = "In"
	PodFailurePolicyOnExitCodesOpNotIn PodFailurePolicyOnExitCodesOperator = "NotIn"
)

// PodReplacementPolicy specifies the policy for creating pod replacements.
type PodReplacementPolicy string
const (
	TerminatingOrFailed PodReplacementPolicy = "TerminatingOrFailed"
	Failed PodReplacementPolicy = "Failed"
)

type PodFailurePolicyOnExitCodesRequirement struct {
	ContainerName *string `json:"containerName,omitempty" protobuf:"bytes,1,opt,name=containerName"`
	Operator PodFailurePolicyOnExitCodesOperator `json:"operator" protobuf:"bytes,2,req,name=operator"`
	Values []int32 `json:"values" protobuf:"varint,3,rep,name=values"`
}

// PodFailurePolicyOnPodConditionsPattern describes a pattern for matching
// an actual pod condition type.
type PodFailurePolicyOnPodConditionsPattern struct {
	Type corev1.PodConditionType `json:"type" protobuf:"bytes,1,req,name=type"`
	Status corev1.ConditionStatus `json:"status" protobuf:"bytes,2,req,name=status"`
}

type PodFailurePolicyRule struct {
	// Specifies the action taken on a pod failure when the requirements are satisfied.
	Action PodFailurePolicyAction `json:"action" protobuf:"bytes,1,req,name=action"`

	// Represents the requirement on the container exit codes.
	OnExitCodes *PodFailurePolicyOnExitCodesRequirement `json:"onExitCodes,omitempty" protobuf:"bytes,2,opt,name=onExitCodes"`

	// Represents the requirement on the pod conditions. The requirement is represented
	// as a list of pod condition patterns. The requirement is satisfied if at
	// least one pattern matches an actual pod condition. At most 20 elements are allowed.
	OnPodConditions []PodFailurePolicyOnPodConditionsPattern `json:"onPodConditions,omitempty" protobuf:"bytes,3,opt,name=onPodConditions"`
}

// PodFailurePolicy describes how failed pods influence the backoffLimit.
type PodFailurePolicy struct {
	Rules []PodFailurePolicyRule `json:"rules" protobuf:"bytes,1,opt,name=rules"`
}

SuccessPolicy成功策略

在创建 Indexed Job 时，你可以通过 .spec.successPolicy 字段定义基于成功 Pod 的条件，来决定何时将整个 Job 标记为成功。

默认情况下，当成功完成的 Pod 数量等于 .spec.completions 时，Job 被视为成功。但在某些场景下，你可能希望对“Job 成功”的判定拥有更灵活的控制，例如：

运行参数不同的模拟任务时：你可能不需要所有模拟都成功，只要部分成功即可认为整体任务成功。
采用主从（leader-worker）模式时：只有主节点（leader）的成功才决定整个 Job 的成败，例如 MPI、PyTorch 等分布式训练框架。

你可以通过在 .spec.successPolicy 字段中配置成功策略来满足上述需求。该策略根据成功完成的 Pod（或索引）来判断 Job 是否成功。一旦 Job 满足成功策略条件，Job 控制器将终止所有仍在运行的 Pod（即“滞留 Pod”）。

如下是SuccessPolicy的结构定义

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/batch/v1/types.go

type SuccessPolicy struct {
	Rules []SuccessPolicyRule `json:"rules" protobuf:"bytes,1,opt,name=rules"`
}

// SuccessPolicyRule describes rule for declaring a Job as succeeded.

type SuccessPolicyRule struct {
	// succeededIndexes specifies the set of indexes
	// which need to be contained in the actual set of the succeeded indexes for the Job.
	SucceededIndexes *string `json:"succeededIndexes,omitempty" protobuf:"bytes,1,opt,name=succeededIndexes"`

	// succeededCount specifies the minimal required size of the actual set of the succeeded indexes
	SucceededCount *int32 `json:"succeededCount,omitempty" protobuf:"varint,2,opt,name=succeededCount"`
}

成功策略由一组规则（rules）定义。每条规则可以采用以下形式之一：

**仅指定 succeededIndexes**：
当 succeededIndexes 中列出的所有索引都成功时，Job 被标记为成功。
succeededIndexes 必须是由 0 到 .spec.completions - 1 范围内的索引组成的区间列表。
**仅指定 succeededCount**：
当成功索引的数量达到 succeededCount 时，Job 被标记为成功。
**同时指定 succeededIndexes 和 succeededCount**：
当在 succeededIndexes 指定的索引子集中，成功索引的数量达到 succeededCount 时，Job 被标记为成功。

注意：
如果在 .spec.successPolicy.rules 中定义了多条规则，Job 控制器会按顺序评估这些规则。一旦某条规则被满足，Job 控制器将立即标记 Job 为成功，并忽略后续规则。

如下是带有成功策略的 Job的示例：

# /controllers/job-success-policy.yaml
apiVersion: batch/v1
kind: Job
meta
  name: job-success
spec:
  parallelism: 10
  completions: 10
  completionMode: Indexed  # 成功策略必须使用 Indexed 模式
  successPolicy:
    rules:
      - succeededIndexes: 0,2-3
        succeededCount: 1
  template:
    spec:
      containers:
      - name: main
        image: python
        command:
          # 只要索引 0、2 或 3 中任意一个成功，整个 Job 即视为成功
          - python3
          - -c
          - |
            import os, sys
            if os.environ.get("JOB_COMPLETION_INDEX") == "2":
              sys.exit(0)
            else:
              sys.exit(1)
      restartPolicy: Never

在上述示例中，同时指定了 succeededIndexes（0,2-3，即索引 0、2、3）和 succeededCount: 1。因此，只要索引 0、2 或 3 中任意一个成功，Job 控制器就会将 Job 标记为成功，并终止其他仍在运行的 Pod。

满足成功策略的 Job 会在状态中添加一个 SuccessCriteriaMet 条件，其 reason 为 SuccessPolicy。在发出终止滞留 Pod 的指令后，Job 会进一步获得 Complete 条件。

索引区间表示说明：
succeededIndexes 使用以连字符（-）分隔的区间表示法。例如 2-3 表示索引 2 和 3；0,2-3 表示索引 0、2、3。

与其他终止策略的交互

注意：
如果你同时配置了成功策略（successPolicy）和某些终止策略（如 .spec.backoffLimit 或 .spec.podFailurePolicy），只要 Job 满足任一终止条件（失败策略）或成功策略，Job 控制器就会立即采取相应动作。

具体而言：一旦 Job 触发了失败策略（如达到 backoffLimit 或匹配 FailJob 规则），Job 控制器将优先处理失败，并忽略成功策略。

换句话说，失败策略的优先级高于成功策略。

Job 终止与清理

当一个 Job 完成后，不会再创建新的 Pod，但已创建的 Pod 通常也不会被自动删除。保留这些已完成的 Pod 可以让你查看它们的日志，以便检查错误、警告或其他诊断信息。Job 对象本身在完成后也会保留在集群中，方便你查看其最终状态。

如何自动清理已完成的 Pod？Kubernetes 本身不提供自动清理已完成 Pod 的内置机制，但有以下几种常用方法：

手动删除 Job（推荐用于测试/开发）

1 2	kubectl delete job my-job # 所有关联的 Pod 也会被删除

使用 TTL 机制（Kubernetes v1.21+）

Job 支持通过 .spec.ttlSecondsAfterFinished 字段自动清理 Job 资源（包括 Pod）。

apiVersion: batch/v1
kind: Job
meta
  name: my-job
spec:
  ttlSecondsAfterFinished: 300  # Job 完成后 300 秒（5分钟）自动删除 Job 及其 Pod
  template:
    spec:
      containers:
      - name: my-container
        image: alpine
        command: ["echo", "hello"]
      restartPolicy: Never

✅ 一旦 Job 进入终端状态（Complete 或 Failed），计时器开始，时间到后 Job 和其所有 Pod 会被自动删除。

CronJob

CronJob，顾名思义是按照指定规律的时间表启动一次性 Job。

功能状态：Kubernetes v1.21 [稳定]

CronJob 用于执行定期调度的操作，例如备份、生成报告等。一个 CronJob 对象类似于 Unix 系统中 crontab（cron 表）文件中的一行。它会根据以 Cron 格式编写的调度计划，周期性地运行一个 Job。

如下是CronJobSpec

// https://github.com/kubernetes/kubernetes/blob/release-1.34/staging/src/k8s.io/api/batch/v1/types.go

type CronJobSpec struct {
	// The schedule in Cron format, see https://en.wikipedia.org/wiki/Cron.
	Schedule string `json:"schedule" protobuf:"bytes,1,opt,name=schedule"`

	// The time zone name for the given schedule,
	TimeZone *string `json:"timeZone,omitempty" protobuf:"bytes,8,opt,name=timeZone"`

	// Optional deadline in seconds for starting the job if it misses scheduled
	// time for any reason.  Missed jobs executions will be counted as failed ones.
	StartingDeadlineSeconds *int64 `json:"startingDeadlineSeconds,omitempty" protobuf:"varint,2,opt,name=startingDeadlineSeconds"`

	// Specifies how to treat concurrent executions of a Job.
	ConcurrencyPolicy ConcurrencyPolicy `json:"concurrencyPolicy,omitempty" protobuf:"bytes,3,opt,name=concurrencyPolicy,casttype=ConcurrencyPolicy"`

	// This flag tells the controller to suspend subsequent executions, it does
	// not apply to already started executions.  Defaults to false.
	Suspend *bool `json:"suspend,omitempty" protobuf:"varint,4,opt,name=suspend"`

	// Specifies the job that will be created when executing a CronJob.
	JobTemplate JobTemplateSpec `json:"jobTemplate" protobuf:"bytes,5,opt,name=jobTemplate"`

	// The number of successful finished jobs to retain. Value must be non-negative integer.
	SuccessfulJobsHistoryLimit *int32 `json:"successfulJobsHistoryLimit,omitempty" protobuf:"varint,6,opt,name=successfulJobsHistoryLimit"`

	// The number of failed finished jobs to retain. Value must be non-negative integer.
	FailedJobsHistoryLimit *int32 `json:"failedJobsHistoryLimit,omitempty" protobuf:"varint,7,opt,name=failedJobsHistoryLimit"`
}

如下示例：

apiVersion: batch/v1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox:1.28
            imagePullPolicy: IfNotPresent
            command:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

下面是CronJobSpec的结构解析：

StartingDeadlineSeconds延迟启动的截止时间

.spec.startingDeadlineSeconds 是可选字段，用于定义 Job 错过预定时间后仍可启动的截止时间（单位为秒）。

如果 Job 因任何原因未能在其预定时间启动，且超过了该截止时间，CronJob 将跳过这一次执行（后续的调度仍会正常进行）。

例如，如果您有一个每天运行两次的备份 Job，您可以允许它最多延迟 8 小时启动——再晚的备份已无意义，不如等待下一次调度。

对于错过截止时间的 Job，Kubernetes 会将其视为失败。如果不设置 startingDeadlineSeconds，则 Job 没有启动截止时间。

如果设置了 .spec.startingDeadlineSeconds（非空），CronJob 控制器会计算当前时间与 Job 应创建时间之间的差值。若该差值超过设定值，则跳过此次执行。

例如，若设为 200，则允许 Job 在实际调度时间之后最多 200 秒内被创建。

ConcurrencyPolicy并发策略

.spec.concurrencyPolicy 是可选字段，用于指定如何处理由该 CronJob 创建的 Job 的并发执行。可选值如下（只能指定其一）：

Allow（默认）：允许并发运行多个 Job。
Forbid：禁止并发。如果新 Job 的执行时间已到，但上一个 Job 尚未完成，则跳过新 Job。注意：当上一个 Job 完成后，仍会考虑 .spec.startingDeadlineSeconds，可能触发新 Job。
Replace：如果新 Job 的执行时间已到，但上一个 Job 尚未完成，则用新 Job 替换当前正在运行的 Job。

注意：并发策略仅适用于同一 CronJob 创建的 Job。不同 CronJob 创建的 Job 始终可以并发运行。

Suspend暂停调度

可通过将可选字段 .spec.suspend 设为 true 来暂停 CronJob 的执行。默认值为 false。

此设置不会影响已经启动的 Job。

一旦设为 true，所有后续的调度执行将被暂停（调度计划仍保留，但 CronJob 控制器不会启动 Job 执行任务），直到您将该字段重新设为 false。

警告：在 .spec.suspend 为 true 期间错过的调度时间会被视为“错过的 Job”。当 .spec.suspend 从 true 改为 false 时，若该 CronJob 未设置 startingDeadlineSeconds，则所有错过的 Job 会立即被调度执行。

Job 历史记录限制

.spec.successfulJobsHistoryLimit 和 .spec.failedJobsHistoryLimit 字段用于指定保留多少已完成和失败的 Job。两者均为可选字段。

**.spec.successfulJobsHistoryLimit**：保留成功完成的 Job 数量。默认值为 3。设为 0 表示不保留任何成功 Job。
**.spec.failedJobsHistoryLimit**：保留失败完成的 Job 数量。默认值为 1。设为 0 表示不保留任何失败 Job。

另一种自动清理 Job 的方法，请参阅《自动清理已完成的 Job》。

TimeZone时区支持

功能状态：Kubernetes v1.27 [稳定]

对于未指定时区的 CronJob，kube-controller-manager 会以其本地时区解释调度时间。

您可以通过设置 .spec.timeZone 为有效的时区名称来指定时区。例如，设置 .spec.timeZone: "Etc/UTC" 表示 Kubernetes 应以协调世界时（UTC）解释调度时间。

Pods

Pods的使用

Pod 模板

Static Pods

Pod生命周期

Pod的Phase阶段

Pod的Conditions状态

Pod的readiness就绪态

Pod中容器的状态

Pod 如何处理容器问题

容器重启策略

容器生命周期回调

容器探针

Init & Sidecar Container

Ephemeral Containers

Workload管理

Deployment

DeploymentSpec详解

Replicas（副本个数）

PodTemplateSpec（Pod模板）

Selector（选择器）

Strategy

Progress Deadline Seconds（进度截止时间）

Min Ready Seconds（最小就绪时间）

Paused

Pod-template-hash

Deployment的更新

Deployement的回滚

Deployement的扩缩容

比例扩缩（Proportional Scaling）

Deployment 滚动更新的暂停与恢复

ReplicaSet

StatefulSet

StatefulSets 的场景

Limitations

StatefulSetSpec详解

VolumeClaimTemplates

Ordinals（序号）

PodManagementPolicy（管理策略）

UpdateStrategy

PersistentVolumeClaimRetentionPolicy

Pod Identity

稳定的网络标识（Stable Network ID）

稳定的存储（Stable Storage）

Pod 名称标签（Pod Name Label）

Pod 序号标签（Pod Index Label）

DaemonSet

Daemon Pod 的调度方式

污点（Taints）与容忍（Tolerations）

Jobs

JobSpec详解

Parallelism并行度

Completions完成数

Completion Mode

ActiveDeadlineSeconds

BackoffLimit退避次数限制

PodFailurePolicy失败策略

SuccessPolicy成功策略

Job 终止与清理

CronJob

StartingDeadlineSeconds延迟启动的截止时间

ConcurrencyPolicy并发策略

Suspend暂停调度

Job 历史记录限制

TimeZone时区支持

`StatefulSetSpec`详解

`JobSpec`详解