摘要:本文对Kubernetes Resource QoS介绍,机制解析和简单的源码分析。
Kubernetes根据Pod中Containers Resource的 request
和 limit
的值来定义Pod的QoS Class。
对于每一种Resource都可以将容器分为3中QoS Classes: Guaranteed , Burstable , and Best-Effort ,它们的QoS级别依次递减。
limit
和 request
都相等且不为0,则这个Pod的QoS Class就是Guaranteed。 注意,如果一个容器只指明了limit,而未指明request,则表明request的值等于limit的值。
Examples:
containers: name: foo resources: limits: cpu: 10m memory: 1Gi name: bar resources: limits: cpu: 100m memory: 100Mi
containers: name: foo resources: limits: cpu: 10m memory: 1Gi requests: cpu: 10m memory: 1Gi name: bar resources: limits: cpu: 100m memory: 100Mi requests: cpu: 100m memory: 100Mi
Examples:
containers: name: foo resources: name: bar resources:
当limit值未指定时,其有效值其实是对应Node Resource的Capacity。
Examples:
容器 bar
没有对Resource进行指定。
containers: name: foo resources: limits: cpu: 10m memory: 1Gi requests: cpu: 10m memory: 1Gi name: bar
容器 foo
和 bar
对不同的Resource进行了指定。
containers: name: foo resources: limits: memory: 1Gi name: bar resources: limits: cpu: 100m
容器 foo
未指定limit,容器 bar
未指定request和limit。
containers: name: foo resources: requests: cpu: 10m memory: 1Gi name: bar
kube-scheduler调度时,是基于Pod的 request
值进行Node Select完成调度的。Pod和它的所有Container都不允许Consume limit指定的有效值(if have)。
How the request and limit are enforced depends on whether the resource is compressible or incompressible .
CPUPods will not be killed if CPU guarantees cannot be met (for example if system tasks or daemons take up lots of CPU), they will be temporarily throttled.
MemoryMemory is an incompressible resource and so let’s discuss the semantics of memory management a bit.
Best-Effort pods will be treated as lowest priority. Processes in these pods are the first to get killed if the system runs out of memory.
These containers can use any amount of free memory in the node though.
Guaranteed pods are considered top-priority and are guaranteed to not be killed until they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.
Burstable pods have some form of minimal resource guarantee, but can use more resources when available.
Under system memory pressure, these containers are more likely to be killed once they exceed their requests and no Best-Effort pods exist.
Pod OOM score configuration
Best-effort
Guaranteed
Burstable
OOM_SCORE_ADJ
to 1000 - 10 * (% of memory requested) 0
, OOM_SCORE_ADJ
is set to 999
. OOM_SCORE
will be 1000, if not its OOM_SCORE
will be < 1000 Pod infra containers or Special Pod init process
OOM_SCORE_ADJ
: -998 Kubelet, Docker
OOM_SCORE_ADJ
: -999 (won’t be OOM killed) QoS的源码位于: pkg/kubelet/qos
,代码非常简单,主要就两个文件 pkg/kubelet/qos/policy.go
, pkg/kubelet/qos/qos.go
。
上面讨论的各个QoS Class对应的 OOM_SCORE_ADJ
定义在:
pkg/kubelet/qos/policy.go:21 const ( PodInfraOOMAdj int = -998 KubeletOOMScoreAdj int = -999 DockerOOMScoreAdj int = -999 KubeProxyOOMScoreAdj int = -999 guaranteedOOMScoreAdj int = -998 besteffortOOMScoreAdj int = 1000 )
容器的OOM_SCORE_ADJ的计算方法定义在:
pkg/kubelet/qos/policy.go:40 func GetContainerOOMScoreAdjust(pod *v1.Pod, container *v1.Container, memoryCapacity int64) int { switch GetPodQOS(pod) { case Guaranteed: // Guaranteed containers should be the last to get killed. return guaranteedOOMScoreAdj case BestEffort: return besteffortOOMScoreAdj } // Burstable containers are a middle tier, between Guaranteed and Best-Effort. Ideally, // we want to protect Burstable containers that consume less memory than requested. // The formula below is a heuristic. A container requesting for 10% of a system's // memory will have an OOM score adjust of 900. If a process in container Y // uses over 10% of memory, its OOM score will be 1000. The idea is that containers // which use more than their request will have an OOM score of 1000 and will be prime // targets for OOM kills. // Note that this is a heuristic, it won't work if a container has many small processes. memoryRequest := container.Resources.Requests.Memory().Value() oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity // A guaranteed pod using 100% of memory can have an OOM score of 10. Ensure // that burstable pods have a higher OOM score adjustment. if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) { return (1000 + guaranteedOOMScoreAdj) } // Give burstable pods a higher chance of survival over besteffort pods. if int(oomScoreAdjust) == besteffortOOMScoreAdj { return int(oomScoreAdjust - 1) } return int(oomScoreAdjust) }
获取Pod的QoS Class的方法为:
pkg/kubelet/qos/qos.go:50 // GetPodQOS returns the QoS class of a pod. // A pod is besteffort if none of its containers have specified any requests or limits. // A pod is guaranteed only when requests and limits are specified for all the containers and they are equal. // A pod is burstable if limits and requests do not match across all containers. func GetPodQOS(pod *v1.Pod) QOSClass { requests := v1.ResourceList{} limits := v1.ResourceList{} zeroQuantity := resource.MustParse("0") isGuaranteed := true for _, container := range pod.Spec.Containers { // process requests for name, quantity := range container.Resources.Requests { if !supportedQoSComputeResources.Has(string(name)) { continue } if quantity.Cmp(zeroQuantity) == 1 { delta := quantity.Copy() if _, exists := requests[name]; !exists { requests[name] = *delta } else { delta.Add(requests[name]) requests[name] = *delta } } } // process limits qosLimitsFound := sets.NewString() for name, quantity := range container.Resources.Limits { if !supportedQoSComputeResources.Has(string(name)) { continue } if quantity.Cmp(zeroQuantity) == 1 { qosLimitsFound.Insert(string(name)) delta := quantity.Copy() if _, exists := limits[name]; !exists { limits[name] = *delta } else { delta.Add(limits[name]) limits[name] = *delta } } } if len(qosLimitsFound) != len(supportedQoSComputeResources) { isGuaranteed = false } } if len(requests) == 0 && len(limits) == 0 { return BestEffort } // Check is requests match limits for all resources. if isGuaranteed { for name, req := range requests { if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 { isGuaranteed = false break } } } if isGuaranteed && len(requests) == len(limits) { return Guaranteed } return Burstable }
PodQoS会在eviction_manager和scheduler的Predicates阶段被调用,也就说会在k8s处理超配和调度预选阶段中被使用。