a story

전체 글

[8] EKS Upgrade 2025.04.02
Jenkins와 Argo CD를 활용한 Kubernetes 환경 CI/CD 구성 2025.03.30
[7] EKS Fargate 2025.03.23
CKA 취득 후기 (2025년 2월 18일 리뉴얼) 2025.03.22
[6] EKS의 Security - EKS 인증/인가와 Pod IAM 권한 할당 2025.03.16
[5-2] EKS의 오토스케일링 Part2 2025.03.07
[5-1] EKS의 오토스케일링 Part1 2025.03.07
curl의 다양한 옵션 2025.03.06
[4] EKS의 모니터링과 로깅 2025.03.01
KCNA, KCSA 후기 2025.02.25

[8] EKS Upgrade

한명 2025. 4. 2. 02:49

2025. 4. 2. 02:49

이번 포스트에서는 EKS Upgrade를 실습을 통해서 알아보겠습니다.

본 실습은 EKS Workshop인 Amazon EKS Upgrades: Strategies and Best Practices 를 바탕으로 진행하였음을 알려드립니다.

해당 워크샵 링크는 아래와 같습니다.

https://catalog.us-east-1.prod.workshops.aws/workshops/693bdee4-bc31-41d5-841f-54e3e54f8f4a/en-US

EKS의 업그레이드와 전략
실습 환경 개요
In-place 클러스터 업그레이드
3.1. 컨트롤 플레인 업그레이드
3.2. Addons 업그레이드
3.3. 관리형 노드 그룹 업그레이드
3.4. Karpenter 노드 업그레이드
3.5. Self-managed 노드 업그레이드
3.6. Fargate 노드 업그레이드
Blue/Green 클러스터 업그레이드

1. EKS의 업그레이드와 전략

쿠버네티스의 버전은 semantic versioning을 따르며, 특정 버전을 x.y.z라고 할 때 각 major.minor.patch 버전을 의미합니다.

새로운 쿠버네티스 마이너 버전은 약 4개월 바다 릴리즈 되며, 모든 버전은 12개월 동안 표준 지원을 제공되고, 한시점에 3개의 마이너 버전에 대한 표준 지원을 제공합니다. 표준 지원을 제공한다는 의미는 해당 버전에 대해서 패치가 지원된다는 의미로 이해하실 수 있습니다.

Amazon EKS는 쿠버네티스의 릴리즈 사이클을 따릅니다만, 세부적으로는 조금 더 넓은 범위의 지원을 보장합니다. EKS에서 특정 버전이 릴리즈되면 14개월 간 표준 지원이 되며, 또한 총 4개의 마이너 버전에 대한 표준 지원을 제공합니다.

현재 지원하는 쿠버네티스 버전에 대해서 아래 Amazon EKS kubernetes 릴리즈 일정을 살펴보시기 바랍니다.

https://docs.aws.amazon.com/ko_kr/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar

EKS는 표준 지원(Standard Support)이 지난 이후에도 12개월의 확장 지원(Extended Support)을 제공하지만 이는 비용이 추가 됩니다.

웹 콘솔에서 EKS 클러스터의 Overview>Kubernetes version settings>Manage를 통해 Upgrade policy를 선택할 수 있습니다.

이때 표준 지원을 선택하는 경우, 표준 지원 기간이 종료되면 자동 업그레이드 되는 점을 유의하셔야 합니다.

버전 업그레이드에 대해서 고민해야 될 부분은 특정 버전이 14개월 동안 표준 지원되기 때문에 14개월 뒤에 업그레이드를 한다고 생각하실 수도 있지만, 사실 1개 버전만 업그레이드 하는 경우, 다음 버전의 EOS가 곧 도래하기 때문에 몇단계를 더 업그레이드 해야만 다시 1년가량을 안정적으로 사용하실 수 있습니다.

예를 들어, 1.27 버전이 2023/05/24~2024/07/24 까지 표준 지원 기간이지만, 2024년 7월에 1.28로 업그레이드를 해도 2024/12/26일에 다시 EOS가 도래합니다. 그렇기 때문에 실제로는 1.27->1.28->1.29->1.30까지 업그레이드를 해야 이후 1년 정도 EOS 이슈가 없이 사용할 수 있습니다.

EKS 업그레이드 과정

EKS의 업그레이드 과정은 실제 업그레이드에 대한 검토와 백업과 같은 내용을 제외하고 클러스터 자체를 업그레이드 하는 작업에 대해서만 설명합니다.

전반적인 업그레이드 절차는 아래와 같이 이뤄집니다.

1) 컨트롤 플레인 업그레이드

2) Add-on 업그레이드

3) 데이터 플레인 업그레이드

이때, 데이터 플레인의 형태가 다양한 경우, 세부적으로 데이터 플레인의 업그레이드 방식이 달라질 수 있습니다.

EKS 업그레이드 전략

EKS의 업그레이드 전략은 In-place 업그레이드와 Blue/Green 업그레이드가 있습니다.

In-place 업그레이드는 현재 운영 중인 클러스터에서 버전을 업그레이드 하는 것을 의미하며, Blue/Green 업그레이드는 신규 클러스터(Green)를 생성해 워크로드를 생성한 뒤 신규 클러스터로 전환하는 방법을 의미합니다.

업그레이드에 대한 세부적인 정보를 아래와 같은 문서를 참고하시기 바랍니다.

Best Practices for Cluster Upgrades

https://docs.aws.amazon.com/eks/latest/best-practices/cluster-upgrades.html

Kubernetes cluster upgrade: the blue-green deployment strategy

https://aws.amazon.com/ko/blogs/containers/kubernetes-cluster-upgrade-the-blue-green-deployment-strategy/

또한 중요한 사항은 Kubernetes Version Skew 정책입니다.

https://kubernetes.io/releases/version-skew-policy/#supported-version-skew

Kubernetes version skew 정책의 의미는 주요 컴포넌트(ex. kube-apiserver, kubelet, etc) 간 버전 차이가 얼마나 허용되는지에 대한 규칙입니다. In-place 업그레이드에서 여러 버전을 순차적으로 업그레이드할 수 있는데, 컨트롤 플레인과 데이터 플레인 간 허용되는 버전 내에서 업그레이드를 고려해야 합니다.

예를 들어, kube-apiserver의 버전이 1.32일 때, 허용되는 kubelet, kube-proxy의 version skew는 1.29까지입니다. 그러하므로 1.29에서 컨트롤 플레인 버전을 1.32까지 업그레이드 할 수 있고, 이후 노드 그룹의 버전을 순차적으로 업그레이드 하시면 됩니다.

또한 Kubernetes 의 In-place 업그레이드는 단계적인 버전 업그레이드만 지원되는 점도 유의를 해야 합니다. 한번에 여러 버전을 업그레이드 할 수 없습니다.

업그레이드 전 사전 검토 과정에서는 Cluster Insight 의 Upgrade insight를 검토해보기 바랍니다.

여기서는 아래와 같이 Kubernetes version skew, 클러스터 상태, add-on 버전 호환성, Deprecated API 에 대한 검토가 이뤄지는 것을 알 수 있씁니다.

이후 실습을 통해서 상세한 내용을 살펴보겠습니다.

2. 실습 환경 개요

본 실습에서는 EKS 1.25 클러스터이며 Extended upgrade policy에 해당하는 것을 알 수 있습니다.

또한 Compute 정보를 살펴보면 다양한 데이터 플레인 형태를 가지고 있습니다. Nodes를 보면 2개의 Managed node가 있고, 2개의 Self-managed node가 있습니다(실제로 1개는 Karpenter 노드입니다). 그리고 Fargate 노드도 확인됩니다.

그러하므로 컨트롤 플레인을 업그레이드 한 뒤, 각 노드 그룹의 유형 별로 다른 업그레이드 방식을 실습을 통해 살펴보겠습니다.

워크샵에서는 웹 콘솔뿐 아니라 code-server를 제공하며, cod-server에 접속하면 terraform 파일(녹색), git-ops-repo에 대한 로컬 파일(빨간색)이 저장되어 있습니다. code-server 우측에는 Terminal이나 파일 편집(파란색)을 할 수 있습니다.

git-ops-repo는 code commit이 remote로 지정되어 있으며, argo CD가 구성되어 code commit 리파지터리를 바라보도록 설정되어 있습니다.

그리고 EKS에서는 argo CD에 의해서 app of apps 형태로 아래와 같은 파드들이 실행 중에 있습니다.

ec2-user:~/environment/terraform:$ kubectl get application -A
NAMESPACE   NAME        SYNC STATUS   HEALTH STATUS
argocd      apps        Synced        Healthy
argocd      assets      Synced        Healthy
argocd      carts       Synced        Healthy
argocd      catalog     Synced        Healthy
argocd      checkout    Synced        Healthy
argocd      karpenter   Synced        Healthy
argocd      orders      Synced        Healthy
argocd      other       Synced        Healthy
argocd      rabbitmq    Synced        Healthy
argocd      ui          OutOfSync     Healthy

ec2-user:~/environment/terraform:$ kubectl get po -A |grep -v kube-system
NAMESPACE     NAME                                                        READY   STATUS    RESTARTS       AGE
argocd        argo-cd-argocd-application-controller-0                     1/1     Running   0              2d8h
argocd        argo-cd-argocd-applicationset-controller-74d9c9c5c7-n5k95   1/1     Running   0              2d8h
argocd        argo-cd-argocd-dex-server-6dbbd57479-mst55                  1/1     Running   0              2d8h
argocd        argo-cd-argocd-notifications-controller-fb4b954d5-v9dw7     1/1     Running   0              2d8h
argocd        argo-cd-argocd-redis-76b4c599dc-c8d2j                       1/1     Running   0              2d8h
argocd        argo-cd-argocd-repo-server-6b777b579d-b7ssz                 1/1     Running   0              2d8h
argocd        argo-cd-argocd-server-86bdbd7b89-gzm7d                      1/1     Running   0              2d8h
assets        assets-7ccc84cb4d-2p284                                     1/1     Running   0              2d8h
carts         carts-7ddbc698d8-wl9k9                                      1/1     Running   1 (2d8h ago)   2d8h
carts         carts-dynamodb-6594f86bb9-8gwpf                             1/1     Running   0              2d8h
catalog       catalog-857f89d57d-nrnf7                                    1/1     Running   3 (2d8h ago)   2d8h
catalog       catalog-mysql-0                                             1/1     Running   0              2d8h
checkout      checkout-558f7777c-z5qvh                                    1/1     Running   0              17h
checkout      checkout-redis-f54bf7cb5-r2sdp                              1/1     Running   0              17h
karpenter     karpenter-74c6ffc5d9-8m6mc                                  1/1     Running   0              2d8h
karpenter     karpenter-74c6ffc5d9-nj7lc                                  1/1     Running   0              2d8h
orders        orders-5b97745747-7rwdl                                     1/1     Running   2 (2d8h ago)   2d8h
orders        orders-mysql-b9b997d9d-bnbmn                                1/1     Running   0              2d8h
rabbitmq      rabbitmq-0                                                  1/1     Running   0              2d8h
ui            ui-5dfb7d65fc-nfrjw                                         1/1     Running   0              2d8h

환경을 이해하기 위해 code-comit 으로 push를 해서 argo CD로 sync가 이뤄지도록 변경을 수행해보겠습니다.

# service를 nlb로 노출
cat << EOF > ~/environment/eks-gitops-repo/apps/ui/service-nlb.yaml
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-type: external
  labels:
    app.kubernetes.io/instance: ui
    app.kubernetes.io/name: ui
  name: ui-nlb
  namespace: ui
spec:
  type: LoadBalancer
  ports:
  - name: http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app.kubernetes.io/instance: ui
    app.kubernetes.io/name: ui
EOF

cat << EOF > ~/environment/eks-gitops-repo/apps/ui/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: ui
resources:
  - namespace.yaml
  - configMap.yaml
  - serviceAccount.yaml
  - service.yaml
  - deployment.yaml
  - hpa.yaml
  - service-nlb.yaml
EOF

#
cd ~/environment/eks-gitops-repo/
git add apps/ui/service-nlb.yaml apps/ui/kustomization.yaml
git commit -m "Add to ui nlb"
git push
argocd app sync ui
...

# UI 접속 URL 확인 (1.5, 1.3 배율)
kubectl get svc -n ui ui-nlb -o jsonpath='{.status.loadBalancer.ingress[0].hostname}' | awk '{ print "UI URL = http://"$1""}'

이제 실제 업그레이드를 수행해 보겠습니다. 클러스터 업그레이드 수단은 웹 콘솔, CLI, IaC 도구 등이 있을 수 있습니다. 본 실습에서는 Terraform을 통해서 모든 업그레이드를 진행합니다.

3. In-place 클러스터 업그레이드

3.1. 컨트롤 플레인 업그레이드

EKS의 컨트롤 플레인 업그레이드는 Blue/Green 업그레이드 방식으로 진행되는 것으로 알려져 있습니다. 업그레이드 과정에서 이슈가 발생하면 업그레이드는 Roll back되어 영향을 최소화 합니다. Rollback 되는 경우 실패 이유를 평가하여 문제를 해결하기 위한 지침을 제공하여, 문제를 해결하고 다시 업그레이드를 시도할 수 있습니다.

컨트롤 플레인 업그레이드에 앞서 서비스 호출을 모니터링 하겠습니다.

export UI_WEB=$(kubectl get svc -n ui ui-nlb -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'/actuator/health/liveness)

while true; do date; curl -s $UI_WEB; echo; aws eks describe-cluster --name eksworkshop-eksctl | egrep 'version|endpoint"|issuer|platformVersion'; echo ; sleep 2; echo; done

Terraform 코드에서 EKS의 버전 정보는 variables.tf 저장되어 있습니다. 여기서 cluster_version을 1.25에서 1.26으로 변경합니다.

variable "cluster_version" {
  description = "EKS cluster version."
  type        = string
  default     = "1.25"
}

variable "mng_cluster_version" {
  description = "EKS cluster mng version."
  type        = string
  default     = "1.25"
}


variable "ami_id" {
  description = "EKS AMI ID for node groups"
  type        = string
  default     = ""
}

터미널에서 아래와 같이 수행하면 대략 10분 내에 완료가 됩니다.

ec2-user:~/environment/terraform:$ terraform apply -auto-approve
aws_iam_user.argocd_user: Refreshing state... [id=argocd-user]
module.vpc.aws_vpc.this[0]: Refreshing state... [id=vpc-0cf5ec98d2e448575]
module.eks.data.aws_partition.current[0]: Reading...
data.aws_caller_identity.current: Reading...
...
Plan: 6 to add, 13 to change, 6 to destroy.
...
...
Apply complete! Resources: 6 added, 2 changed, 6 destroyed.

Outputs:

configure_kubectl = "aws eks --region us-west-2 update-kubeconfig --name eksworkshop-eksctl"

웹 콘솔에서도 업그레이드가 트리거 된 것을 확인 할 수 있습니다.

컨트롤 플레인 업그레이드 전/후 시간과 상태를 보면 아래와 같습니다.

# 업그레이드 전
Tue Apr  1 14:22:41 UTC 2025
{"status":"UP"}
        "version": "1.25",
        "endpoint": "https://C55B5928163C30776DEF011A92FE870C.gr7.us-west-2.eks.amazonaws.com",
                "issuer": "https://oidc.eks.us-west-2.amazonaws.com/id/C55B5928163C30776DEF011A92FE870C"
        "platformVersion": "eks.44",
...
Tue Apr  1 14:30:20 UTC 2025
{"status":"UP"}
        "version": "1.26",
        "endpoint": "https://C55B5928163C30776DEF011A92FE870C.gr7.us-west-2.eks.amazonaws.com",
                "issuer": "https://oidc.eks.us-west-2.amazonaws.com/id/C55B5928163C30776DEF011A92FE870C"
        "platformVersion": "eks.45",

업그레이드를 진행한 이후에 데이터 플레인의 버전과는 상이한 상태이지만, Kubernetes version skew에서는 문제가 없는 상황입니다. 아래와 같이 Upgrade insight를 확인해볼 수 있습니다.

3.2. Addons 업그레이드

이제 애드온 업그레이드를 진행하겠습니다.

eksctl 을 통해서 가능한 업그레이드 버전을 확인할 수 있습니다.

NAME                    VERSION                 STATUS  ISSUES  IAMROLE                                                                                         UPDATE AVAILABLE                                                                                                                                                   CONFIGURATION VALUES
aws-ebs-csi-driver      v1.41.0-eksbuild.1      ACTIVE  0       arn:aws:iam::181150650881:role/eksworkshop-eksctl-ebs-csi-driver-2025033005221599450000001d
coredns                 v1.8.7-eksbuild.10      ACTIVE  0                                                                                                       v1.9.3-eksbuild.22,v1.9.3-eksbuild.21,v1.9.3-eksbuild.19,v1.9.3-eksbuild.17,v1.9.3-eksbuild.15,v1.9.3-eksbuild.11,v1.9.3-eksbuild.10,v1.9.3-eksbuild.9,v1.9.3-eksbuild.7,v1.9.3-eksbuild.6,v1.9.3-eksbuild.5,v1.9.3-eksbuild.3,v1.9.3-eksbuild.2
kube-proxy              v1.25.16-eksbuild.8     ACTIVE  0                                                                                                       v1.26.15-eksbuild.24,v1.26.15-eksbuild.19,v1.26.15-eksbuild.18,v1.26.15-eksbuild.14,v1.26.15-eksbuild.10,v1.26.15-eksbuild.5,v1.26.15-eksbuild.2,v1.26.13-eksbuild.2,v1.26.11-eksbuild.4,v1.26.11-eksbuild.1,v1.26.9-eksbuild.2,v1.26.7-eksbuild.2,v1.26.6-eksbuild.2,v1.26.6-eksbuild.1,v1.26.4-eksbuild.1,v1.26.2-eksbuild.1
vpc-cni                 v1.19.3-eksbuild.1      ACTIVE  0

각 애드온은 아래 페이지에서 EKS 버전 별 호환 버전을 확인하실 수 있습니다.

Amazon VPC CNI: https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html
coredns: https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html
kube-proxy: https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html

또한 아래 명령으로 1.26에 호환되는 버전 정보를 확인 할 수 있습니다. 현재 VPC CNI와 EBS CSI driver는 최신 버전을 사용 중으로 coredns(v1.8.7-eksbuild.10), kube-proxy(v1.25.16-eksbuild.8)에 대해서 확인합니다.

ec2-user:~/environment/terraform:$ aws eks describe-addon-versions --addon-name coredns --kubernetes-version 1.26 --output table \
    --query "addons[].addonVersions[:10].{Version:addonVersion,DefaultVersion:compatibilities[0].defaultVersion}"
------------------------------------------
|          DescribeAddonVersions         |
+-----------------+----------------------+
| DefaultVersion  |       Version        |
+-----------------+----------------------+
|  False          |  v1.9.3-eksbuild.22  |
|  False          |  v1.9.3-eksbuild.21  |
|  False          |  v1.9.3-eksbuild.19  |
|  False          |  v1.9.3-eksbuild.17  |
|  False          |  v1.9.3-eksbuild.15  |
|  False          |  v1.9.3-eksbuild.11  |
|  False          |  v1.9.3-eksbuild.10  |
|  False          |  v1.9.3-eksbuild.9   |
|  True           |  v1.9.3-eksbuild.7   |
|  False          |  v1.9.3-eksbuild.6   |
+-----------------+----------------------+
ec2-user:~/environment/terraform:$ aws eks describe-addon-versions --addon-name kube-proxy --kubernetes-version 1.26 --output table \
    --query "addons[].addonVersions[:10].{Version:addonVersion,DefaultVersion:compatibilities[0].defaultVersion}"
--------------------------------------------
|           DescribeAddonVersions          |
+-----------------+------------------------+
| DefaultVersion  |        Version         |
+-----------------+------------------------+
|  False          |  v1.26.15-eksbuild.24  |
|  False          |  v1.26.15-eksbuild.19  |
|  False          |  v1.26.15-eksbuild.18  |
|  False          |  v1.26.15-eksbuild.14  |
|  False          |  v1.26.15-eksbuild.10  |
|  False          |  v1.26.15-eksbuild.5   |
|  False          |  v1.26.15-eksbuild.2   |
|  False          |  v1.26.13-eksbuild.2   |
|  False          |  v1.26.11-eksbuild.4   |
|  False          |  v1.26.11-eksbuild.1   |
+-----------------+------------------------+

테라폼 코드 중 addons.tf 를 열어 아래의 정보를 최신 버전으로 변경합니다.

  eks_addons = {
    coredns = {
      addon_version = "v1.8.7-eksbuild.10"
    }
    kube-proxy = {
      addon_version = "v1.25.16-eksbuild.8"
    }
    vpc-cni = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      service_account_role_arn = module.ebs_csi_driver_irsa.iam_role_arn
    }
  }

terraform 명령으로 업그레이드를 진행합니다.

ec2-user:~/environment/terraform:$ terraform apply -auto-approve
aws_iam_user.argocd_user: Refreshing state... [id=argocd-user]
data.aws_caller_identity.current: Reading...

...

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.

Outputs:

configure_kubectl = "aws eks --region us-west-2 update-kubeconfig --name eksworkshop-eksctl"

대략 1분 30초 정도가 소요되었습니다.

Tue Apr  1 14:58:07 UTC 2025
{"status":"UP"}
        "version": "1.26",
        "endpoint": "https://C55B5928163C30776DEF011A92FE870C.gr7.us-west-2.eks.amazonaws.com",
                "issuer": "https://oidc.eks.us-west-2.amazonaws.com/id/C55B5928163C30776DEF011A92FE870C"
        "platformVersion": "eks.45",
...
Tue Apr  1 14:59:36 UTC 2025
{"status":"UP"}
        "version": "1.26",
        "endpoint": "https://C55B5928163C30776DEF011A92FE870C.gr7.us-west-2.eks.amazonaws.com",
                "issuer": "https://oidc.eks.us-west-2.amazonaws.com/id/C55B5928163C30776DEF011A92FE870C"
        "platformVersion": "eks.45",

관련 파드들이 롤링 업데이트 됩니다. 과정을 살펴보면 coredns는 pdb가 지정되어 있기 때문에 하나의 파드가 Running 상태가 된 이후 old파드가 Terminating되는 것을 알 수 있습니다. kube-proxy는 데몬 셋으로 종료 후 신규 파드로 생성됩니다.

ec2-user:~/environment:$ kubectl get pdb -n kube-system
NAME                           MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
aws-load-balancer-controller   N/A             1                 1                     2d9h
coredns                        N/A             1                 1                     2d9h
ebs-csi-controller             N/A             1                 1                     2d9h

ec2-user:~/environment:$ kubectl get po -n kube-system -w 
NAME                                            READY   STATUS    RESTARTS   AGE
...
kube-proxy-rdhmw                                1/1     Terminating   0          2d9h
coredns-98f76fbc4-d7l7z                         1/1     Terminating   0          2d9h
kube-proxy-rdhmw                                0/1     Terminating   0          2d9h
kube-proxy-rdhmw                                0/1     Terminating   0          2d9h
kube-proxy-rdhmw                                0/1     Terminating   0          2d9h
coredns-58cc4d964b-5rbmb                        0/1     Pending       0          0s
coredns-58cc4d964b-5rbmb                        0/1     Pending       0          0s
coredns-58cc4d964b-5rbmb                        0/1     ContainerCreating   0          0s
coredns-58cc4d964b-d5zmg                        0/1     Pending             0          0s
coredns-58cc4d964b-d5zmg                        0/1     Pending             0          0s
kube-proxy-gbn46                                0/1     Pending             0          1s
kube-proxy-gbn46                                0/1     Pending             0          1s
coredns-58cc4d964b-d5zmg                        0/1     ContainerCreating   0          0s
kube-proxy-gbn46                                0/1     ContainerCreating   0          1s
coredns-98f76fbc4-d7l7z                         0/1     Terminating         0          2d9h
coredns-98f76fbc4-d7l7z                         0/1     Terminating         0          2d9h
coredns-98f76fbc4-d7l7z                         0/1     Terminating         0          2d9h
coredns-58cc4d964b-d5zmg                        0/1     Running             0          2s
coredns-58cc4d964b-d5zmg                        1/1     Running             0          2s
coredns-98f76fbc4-brtkn                         1/1     Terminating         0          2d9h
coredns-58cc4d964b-5rbmb                        0/1     Running             0          3s
coredns-58cc4d964b-5rbmb                        1/1     Running             0          3s
kube-proxy-gbn46                                1/1     Running             0          3s
kube-proxy-rkvpc                                1/1     Terminating         0          2d9h
coredns-98f76fbc4-brtkn                         0/1     Terminating         0          2d9h
coredns-98f76fbc4-brtkn                         0/1     Terminating         0          2d9h
coredns-98f76fbc4-brtkn                         0/1     Terminating         0          2d9h
kube-proxy-rkvpc                                0/1     Terminating         0          2d9h
kube-proxy-rkvpc                                0/1     Terminating         0          2d9h
kube-proxy-rkvpc                                0/1     Terminating         0          2d9h
kube-proxy-tt8mk                                0/1     Pending             0          0s
kube-proxy-tt8mk                                0/1     Pending             0          0s
kube-proxy-tt8mk                                0/1     ContainerCreating   0          0s
kube-proxy-tt8mk                                1/1     Running             0          2s
kube-proxy-psbfc                                1/1     Terminating         0          2d9h
kube-proxy-psbfc                                0/1     Terminating         0          2d9h
kube-proxy-psbfc                                0/1     Terminating         0          2d9h
kube-proxy-psbfc                                0/1     Terminating         0          2d9h
kube-proxy-vv6cz                                0/1     Pending             0          0s
kube-proxy-vv6cz                                0/1     Pending             0          0s
kube-proxy-vv6cz                                0/1     ContainerCreating   0          0s
kube-proxy-vv6cz                                1/1     Running             0          2s
kube-proxy-sv977                                1/1     Terminating         0          2d9h
kube-proxy-sv977                                0/1     Terminating         0          2d9h
kube-proxy-sv977                                0/1     Terminating         0          2d9h
kube-proxy-sv977                                0/1     Terminating         0          2d9h
kube-proxy-t9xxk                                0/1     Pending             0          0s
kube-proxy-t9xxk                                0/1     Pending             0          0s
kube-proxy-t9xxk                                0/1     ContainerCreating   0          0s
kube-proxy-t9xxk                                1/1     Running             0          2s
kube-proxy-5zz6t                                1/1     Terminating         0          17h
kube-proxy-5zz6t                                0/1     Terminating         0          17h
kube-proxy-5zz6t                                0/1     Terminating         0          17h
kube-proxy-5zz6t                                0/1     Terminating         0          17h
kube-proxy-zh6st                                0/1     Pending             0          0s
kube-proxy-zh6st                                0/1     Pending             0          0s
kube-proxy-zh6st                                0/1     ContainerCreating   0          0s
kube-proxy-zh6st                                1/1     Running             0          2s
kube-proxy-jbwlb                                1/1     Terminating         0          2d9h
kube-proxy-jbwlb                                0/1     Terminating         0          2d9h
kube-proxy-jbwlb                                0/1     Terminating         0          2d9h
kube-proxy-jbwlb                                0/1     Terminating         0          2d9h
kube-proxy-6jlqj                                0/1     Pending             0          0s
kube-proxy-6jlqj                                0/1     Pending             0          0s
kube-proxy-6jlqj                                0/1     ContainerCreating   0          0s
kube-proxy-6jlqj                                1/1     Running             0          2s

3.3. 관리형 노드 그룹 업그레이드

In-place 클러스터 업그레이드에서도 관리형 노드 그룹의 업그레이드를 In-place와 Blue/Green 업그레이드로 선택 진행할 수 있습니다.

관리형 노드 그룹 In-place 업그레이드

In-place 업그레이드는 점진적인 롤링 업그레이드로 구현되어, 새로운 노드가 먼저 ASG에 추가되고, 이후 구 노드는 cordon, drain, remove 되는 방식으로 진행됩니다.

이 과정을 설정 단계>확장 단계>업그레이드 단계>축소단계로 이해할 수 있습니다.

1) 설정 단계

최신 Lunch template 버전을 사용하도록 ASG를 업데이트하고 updateConfig 속성을 사용하여 병렬로 업그레이드할 노드의 최대 수를 결정.

참고로 updateConfig는 노드 그룹의 속성에서 확인할 수 있습니다.

이때, Update strategy의 Default는 새 노드를 먼저 추가 후 구 노드를 삭제하는 방식이고, Minimal 구 노드를 바로 삭제하는 방식입니다. 비용이 우선시 되는 노드 그룹은 Minimal을 선택할 수 있습니다.

2) 확장 단계

ASG의 Maximum size나 desired size 중 큰 값으로 증가시킵니다. 또한 배포된 가용 영역 수의 두배까지 증가합니다.

이 단계에서 노드그룹을 확장하면 구 노드에 대해서는 un-schedulable로 마크하고, node.kubernetes.io/exclude-from-external-load-balancers=true를 설정해 로드 밸러서에서 노들르 제거할 수 있도록 합니다.

3) 업그레이드 단계

노드에서 파드 drain 하고, 노드를 cordon합니다. 이후 ASG에 종료 요청을 보냅니다. Unavailable 단위로 진행할 수 있으며, 모든 구 노드가 삭제될 때까지 업그레이드 단계를 반복합니다.

4) 축소단계

ASG의 Maximum과 Desired 를 1씩 줄여서 업데이트가 시작되기 전의 값으로 돌아갑니다.

이제 in-place 업그레이드를 진행합니다.

variable.tf 파일의 관리형 노드 그룹에 대한 값을 1.26으로 변경합니다.

variable "cluster_version" {
  description = "EKS cluster version."
  type        = string
  default     = "1.26"
}

variable "mng_cluster_version" {
  description = "EKS cluster mng version."
  type        = string
  default     = "1.25" # <- 1.26 
}


variable "ami_id" {
  description = "EKS AMI ID for node groups"
  type        = string
  default     = ""
}

마찬가지로 terraform을 적용합니다.

ec2-user:~/environment/terraform:$ terraform apply -auto-approve
...

Apply complete! Resources: 3 added, 1 changed, 3 destroyed.

Outputs:

configure_kubectl = "aws eks --region us-west-2 update-kubeconfig --name eksworkshop-eksctl"
ec2-user:~/environment/terraform:$

업그레이드 과정에서 증가/축소 및 노드 상태를 확인하기 위해서 아래와 같이 모니터링을 하겠습니다.

while true; do date; kubectl get nodes -o wide --label-columns=eks.amazonaws.com/nodegroup,topology.kubernetes.io/zone |grep initial; sleep 5; echo; done

# 최초 
ec2-user:~/environment:$ while true; do date; kubectl get nodes -o wide --label-columns=eks.amazonaws.com/nodegroup,topology.kubernetes.io/zone |grep initial; sleep 5; echo; done
Tue Apr  1 15:49:52 UTC 2025
ip-10-0-12-239.us-west-2.compute.internal           Ready    <none>   2d10h   v1.25.16-eks-59bf375   10.0.12.239   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-32-55.us-west-2.compute.internal            Ready    <none>   2d10h   v1.25.16-eks-59bf375   10.0.32.55    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c


# 증가 단계 (4대의 노드가 추가 됨)
Tue Apr  1 15:53:06 UTC 2025
ip-10-0-12-239.us-west-2.compute.internal           Ready    <none>   2d10h   v1.25.16-eks-59bf375   10.0.12.239   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-15-190.us-west-2.compute.internal           Ready    <none>   27s     v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready    <none>   2m27s   v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-30-83.us-west-2.compute.internal            Ready    <none>   2m26s   v1.26.15-eks-59bf375   10.0.30.83    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-32-55.us-west-2.compute.internal            Ready    <none>   2d10h   v1.25.16-eks-59bf375   10.0.32.55    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c
ip-10-0-46-150.us-west-2.compute.internal           Ready    <none>   94s     v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c


# 업그레이드 단계
# old node cordon
Tue Apr  1 15:53:12 UTC 2025
ip-10-0-12-239.us-west-2.compute.internal           Ready,SchedulingDisabled   <none>   2d10h   v1.25.16-eks-59bf375   10.0.12.239   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-15-190.us-west-2.compute.internal           Ready                      <none>   33s     v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready                      <none>   2m33s   v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-30-83.us-west-2.compute.internal            Ready                      <none>   2m32s   v1.26.15-eks-59bf375   10.0.30.83    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-32-55.us-west-2.compute.internal            Ready,SchedulingDisabled   <none>   2d10h   v1.25.16-eks-59bf375   10.0.32.55    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c
ip-10-0-46-150.us-west-2.compute.internal           Ready                      <none>   100s    v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

# x.x.x.239 old node NotReady
Tue Apr  1 15:56:07 UTC 2025
ip-10-0-12-239.us-west-2.compute.internal           NotReady,SchedulingDisabled   <none>   2d10h   v1.25.16-eks-59bf375   10.0.12.239   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-15-190.us-west-2.compute.internal           Ready                         <none>   3m29s   v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready                         <none>   5m29s   v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-30-83.us-west-2.compute.internal            Ready                         <none>   5m28s   v1.26.15-eks-59bf375   10.0.30.83    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-32-55.us-west-2.compute.internal            Ready,SchedulingDisabled      <none>   2d10h   v1.25.16-eks-59bf375   10.0.32.55    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c
ip-10-0-46-150.us-west-2.compute.internal           Ready                         <none>   4m36s   v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

# x.x.x.239 old node removed
Tue Apr  1 15:56:14 UTC 2025
ip-10-0-15-190.us-west-2.compute.internal           Ready                      <none>   3m35s   v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready                      <none>   5m35s   v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-30-83.us-west-2.compute.internal            Ready                      <none>   5m34s   v1.26.15-eks-59bf375   10.0.30.83    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-32-55.us-west-2.compute.internal            Ready,SchedulingDisabled   <none>   2d10h   v1.25.16-eks-59bf375   10.0.32.55    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c
ip-10-0-46-150.us-west-2.compute.internal           Ready                      <none>   4m42s   v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

# x.x.x.55 old node Not ready
Tue Apr  1 15:59:08 UTC 2025
ip-10-0-15-190.us-west-2.compute.internal           Ready                         <none>   6m29s   v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready                         <none>   8m29s   v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-30-83.us-west-2.compute.internal            Ready                         <none>   8m28s   v1.26.15-eks-59bf375   10.0.30.83    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-32-55.us-west-2.compute.internal            NotReady,SchedulingDisabled   <none>   2d10h   v1.25.16-eks-59bf375   10.0.32.55    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c
ip-10-0-46-150.us-west-2.compute.internal           Ready                         <none>   7m36s   v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

# all old nodes removed and all nodes are 1.26.15
Tue Apr  1 15:59:32 UTC 2025
ip-10-0-15-190.us-west-2.compute.internal           Ready    <none>   6m53s   v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready    <none>   8m53s   v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-30-83.us-west-2.compute.internal            Ready    <none>   8m52s   v1.26.15-eks-59bf375   10.0.30.83    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-46-150.us-west-2.compute.internal           Ready    <none>   8m      v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c


# 축소 단계
# new node 도 cordon 상태로 빠짐
Tue Apr  1 16:00:29 UTC 2025
ip-10-0-15-190.us-west-2.compute.internal           Ready                      <none>   7m50s   v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready                      <none>   9m50s   v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-30-83.us-west-2.compute.internal            Ready,SchedulingDisabled   <none>   9m49s   v1.26.15-eks-59bf375   10.0.30.83    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-46-150.us-west-2.compute.internal           Ready                      <none>   8m57s   v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

# 노드 4대 -> 3대
Tue Apr  1 16:02:22 UTC 2025
ip-10-0-15-190.us-west-2.compute.internal           Ready    <none>   9m43s   v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready    <none>   11m     v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-46-150.us-west-2.compute.internal           Ready    <none>   10m     v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

# 한대 더 corndon
Tue Apr  1 16:03:31 UTC 2025
ip-10-0-15-190.us-west-2.compute.internal           Ready,SchedulingDisabled   <none>   10m     v1.26.15-eks-59bf375   10.0.15.190   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2a
ip-10-0-28-191.us-west-2.compute.internal           Ready                      <none>   12m     v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-46-150.us-west-2.compute.internal           Ready                      <none>   11m     v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

# 최종 2대, 1.26 버전으로 업그레이드 됨
Tue Apr  1 16:05:24 UTC 2025
ip-10-0-28-191.us-west-2.compute.internal           Ready    <none>   14m     v1.26.15-eks-59bf375   10.0.28.191   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2b
ip-10-0-46-150.us-west-2.compute.internal           Ready    <none>   13m     v1.26.15-eks-59bf375   10.0.46.150   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   initial-2025033005225054810000002a    us-west-2c

EKS의 업그레이드는 신규 노드가 추가되고 drain>cordon>drain으로 구 노드를 삭제하고, 최종 desired 수로 축소하는 방식으로 이뤄집니다. 이로 인해 결과적으로는 구 노드가 삭제되고, 신규 버전으로 생성된 신규 노드가 남는 방식으로 업그레이드가 진행됩니다.

2대 노드를 가진 노드 그룹의 업그레이드 시간은 대략 15분(15:49:52~16:05:24) 정도가 소요되었습니다.

관리형 노드 그룹 Blue/Green 업그레이드

관리형 노드 그룹도 Blue/Green 업그레이드 방식을 선택할 수 있습니다.

해당 실습에서 blue 노드 그룹은 특정 stateful 워크로드와 PV 사용으로 특정 가용 영역에만 프로비저닝이 되어 있습니다.

이 경우의 업그레이드 방식은 먼저 terraform 에 Green 관리형 노드 그룹을 생성하고, 이후 Blue 관리형 노드 그룹을 삭제하는 방식으로 진행됩니다.

먼저 base.tf에 Green 노드 그룹을 생성합니다.

  eks_managed_node_groups = {
    initial = {
      instance_types = ["m5.large", "m6a.large", "m6i.large"]
      min_size     = 2
      max_size     = 10
      desired_size = 2
      update_config = {
        max_unavailable_percentage = 35
      }
    }

    blue-mng={
      instance_types = ["m5.large", "m6a.large", "m6i.large"]
      cluster_version = "1.25"
      min_size     = 1
      max_size     = 2
      desired_size = 1
      update_config = {
        max_unavailable_percentage = 35
      }
      labels = {
        type = "OrdersMNG"
      }
      subnet_ids = [module.vpc.private_subnets[0]] # 해당 MNG은 프라이빗서브넷1 에서 동작(ebs pv 사용 중)
      taints = [
        {
          key    = "dedicated"
          value  = "OrdersApp"
          effect = "NO_SCHEDULE"
        }
      ]
    }

    green-mng={
      instance_types = ["m5.large", "m6a.large", "m6i.large"]
      subnet_ids = [module.vpc.private_subnets[0]]
      min_size     = 1
      max_size     = 2
      desired_size = 1
      update_config = {
        max_unavailable_percentage = 35
      }
      labels = {
        type = "OrdersMNG"
      }
      taints = [
        {
          key    = "dedicated"
          value  = "OrdersApp"
          effect = "NO_SCHEDULE"
        }
      ]
    }
  }

그리고 terraform apply -auto-approve을 수행하고, 노드가 증가한 상태를 확인 합니다. 동일한 가용 영역에 생성된 것을 확인할 수 있습니다.

ec2-user:~/environment:$ while true; do date; kubectl get nodes -o wide --label-columns=eks.amazonaws.com/nodegroup,topology.kubernetes.io/zone |egrep "green|blue"; sleep 5; echo; done
Tue Apr  1 16:24:49 UTC 2025
ip-10-0-3-145.us-west-2.compute.internal            Ready    <none>   2d11h   v1.25.16-eks-59bf375   10.0.3.145    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   blue-mng-2025033005225055020000002c   us-west-2a
...
Tue Apr  1 16:27:44 UTC 2025
ip-10-0-3-145.us-west-2.compute.internal            Ready    <none>   2d11h   v1.25.16-eks-59bf375   10.0.3.145    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   blue-mng-2025033005225055020000002c    us-west-2a
ip-10-0-3-227.us-west-2.compute.internal            Ready    <none>   40s     v1.26.15-eks-59bf375   10.0.3.227    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   green-mng-20250401162553095800000007   us-west-2a

이제 base.tf에서 blue 노드 그룹을 삭제하고 terraform apply -auto-approve을 수행합니다.

현재 blue 노드 그룹에는 orders 파드들이 실행 중인 것을 알 수 있습니다.

ec2-user:~/environment:$ kubectl get po -A -owide |grep ip-10-0-3-145.us-west-2.compute.internal
kube-system   aws-node-g9sk9                                              2/2     Running   0               2d11h   10.0.3.145    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>
kube-system   ebs-csi-node-8jqbj                                          3/3     Running   0               2d11h   10.0.0.162    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>
kube-system   efs-csi-node-x546f                                          3/3     Running   0               2d11h   10.0.3.145    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>
kube-system   kube-proxy-6jlqj                                            1/1     Running   0               92m     10.0.3.145    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>
orders        orders-5b97745747-7rwdl                                     1/1     Running   2 (2d10h ago)   2d11h   10.0.3.163    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>
orders        orders-mysql-b9b997d9d-bnbmn                                1/1     Running   0               2d11h   10.0.7.229    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>

노드와 이들 파드들을 모니터링 하겠습니다.

# 최초 상태
ec2-user:~/environment:$  while true; do date; kubectl get nodes -o wide --label-columns=eks.amazonaws.com/nodegroup,topology.kubernetes.io/zone |egrep "green|blue";echo;  kubectl get po -A -owide |grep orders; sleep 5; echo; done
Tue Apr  1 16:34:39 UTC 2025
ip-10-0-3-145.us-west-2.compute.internal            Ready    <none>   2d11h   v1.25.16-eks-59bf375   10.0.3.145    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   blue-mng-2025033005225055020000002c    us-west-2a
ip-10-0-3-227.us-west-2.compute.internal            Ready    <none>   7m35s   v1.26.15-eks-59bf375   10.0.3.227    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   green-mng-20250401162553095800000007   us-west-2a

orders        orders-5b97745747-7rwdl                                     1/1     Running   2 (2d11h ago)   2d11h   10.0.3.163    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>
orders        orders-mysql-b9b997d9d-bnbmn                                1/1     Running   0               2d11h   10.0.7.229    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>

# Blue 노드 cordon>drain으로 Green 노드로 이전함
Tue Apr  1 16:35:04 UTC 2025
ip-10-0-3-145.us-west-2.compute.internal            Ready    <none>   2d11h   v1.25.16-eks-59bf375   10.0.3.145    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   blue-mng-2025033005225055020000002c    us-west-2a
ip-10-0-3-227.us-west-2.compute.internal            Ready    <none>   8m      v1.26.15-eks-59bf375   10.0.3.227    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   green-mng-20250401162553095800000007   us-west-2a

orders        orders-5b97745747-7rwdl                                     1/1     Running   2 (2d11h ago)   2d11h   10.0.3.163    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>
orders        orders-mysql-b9b997d9d-bnbmn                                1/1     Running   0               2d11h   10.0.7.229    ip-10-0-3-145.us-west-2.compute.internal            <none>           <none>

Tue Apr  1 16:35:11 UTC 2025
ip-10-0-3-145.us-west-2.compute.internal            Ready,SchedulingDisabled   <none>   2d11h   v1.25.16-eks-59bf375   10.0.3.145    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   blue-mng-2025033005225055020000002c    us-west-2a
ip-10-0-3-227.us-west-2.compute.internal            Ready                      <none>   8m7s    v1.26.15-eks-59bf375   10.0.3.227    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   green-mng-20250401162553095800000007   us-west-2a

orders        orders-5b97745747-7ctj4                                     0/1     ContainerCreating   0               5s      <none>        ip-10-0-3-227.us-west-2.compute.internal            <none>           <none>
orders        orders-mysql-b9b997d9d-wc9vn                                0/1     ContainerCreating   0               5s      <none>        ip-10-0-3-227.us-west-2.compute.internal            <none>           <none>

# 최종 상태
Tue Apr  1 16:37:03 UTC 2025
ip-10-0-3-227.us-west-2.compute.internal            Ready    <none>   10m     v1.26.15-eks-59bf375   10.0.3.227    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25   green-mng-20250401162553095800000007   us-west-2a

orders        orders-5b97745747-7ctj4                                     1/1     Running   2 (81s ago)     118s    10.0.8.104    ip-10-0-3-227.us-west-2.compute.internal            <none>           <none>
orders        orders-mysql-b9b997d9d-wc9vn                                1/1     Running   0               118s    10.0.12.108   ip-10-0-3-227.us-west-2.compute.internal            <none>           <none>

관리형 노드 그룹에 대한 실습을 마무리 하겠습니다.

3.4. Karpenter 노드 업그레이드

앞서 살펴본 실습에서 EBS PV를 사용하는 경우와 같이 관리형 노드 그룹을 사용하는 경우에는 신규 노드 그룹을 생성하는 시점 고려하는 사항이 많습니다.

반면 Karpenter 노드는 내부적으로 신규 노드를 추가하는 시점 이러한 PV의 위치까지 고려하여 보다 사용 편의가 높습니다.

Karpenter에서는 노드 업그레이드를 위해서 Drift 혹은 TTL(expireAfter)과 같은 기능을 사용할 수 있습니다. 동작 방식이 다를 뿐 결과적으로는 EC2NodeClass에 업그레이드할 AMI로 변경하면, 원하는 사양으로 유도하거나 혹은 TTL이 지난 시점 변경되도록 하는 방식입니다.

그러하므로 이 실습은 terraform이 아닌 EC2NodeClass를 수정하는 방식을 사용합니다.

여기서는 code commit과 argoCD가 연결되어 있기 때문에, 로컬에서 수정하고 code commit으로 push한 뒤 karpenter 애플리케이션을 sync하도록 하겠습니다.

먼저 1.26에 해당하는 AMI ID를 확인합니다.

## AMI ID 확인
ec2-user:~/environment/terraform:$  aws ssm get-parameter --name /aws/service/eks/optimized-ami/1.26/amazon-linux-2/recommended/image_id \
    --region ${AWS_REGION} --query "Parameter.Value" --output text
ami-086414611b43bb691

이후 로컬의 eks-gitops-repo\apps\karpenter로 이동하여 default-ec2nc.yaml의 AMI ID를 확인한 AMI ID로 변경합니다.

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: AL2
  amiSelectorTerms:
  - id: "ami-0ee947a6f4880da75" # Latest EKS 1.25 AMI
  role: karpenter-eksworkshop-eksctl
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: eksworkshop-eksctl
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: eksworkshop-eksctl
  tags:
    intent: apps
    managed-by: karpenter
    team: checkout

변경된 파일을 code commit으로 push하고, argo CD를 sync 합니다.

# 10분 소요 (예상) 실습 포함
cd ~/environment/eks-gitops-repo
git add apps/karpenter/default-ec2nc.yaml apps/karpenter/default-np.yaml
git commit -m "disruption changes"
git push --set-upstream origin main
argocd app sync karpenter

# 모니터링
while true; do date; kubectl get nodeclaim; echo ; kubectl get nodes -l team=checkout; echo ; kubectl get nodes -l team=checkout -o jsonpath="{range .items[*]}{.metadata.name} {.spec.taints}{\"\n\"}{end}"; echo ; kubectl get pods -n checkout -o wide; echo ; sleep 1; echo; done

# 최초 상태
ec2-user:~/environment:$ 
while true; do date; kubectl get nodeclaim; echo ; kubectl get nodes -l team=checkout; echo ; kubectec2-user:~/environment:$ while true; do date; kubectl get nodeclaim; echo ; kubectl get nodes -l team=checkout; echo ; kubectl get nodes -l team=checkout -o jsonpath="{range .items[*]}{.metadata.name} {.spec.taints}{\"\n\"}{end}"; echo ; kubectl get pods -n checkout -o wide; echo ; sleep 1; echo; done
Tue Apr  1 16:54:08 UTC 2025
NAME            TYPE       ZONE         NODE                                        READY   AGE
default-6css4   c4.large   us-west-2b   ip-10-0-24-100.us-west-2.compute.internal   True    19h

NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-24-100.us-west-2.compute.internal   Ready    <none>   19h   v1.25.16-eks-59bf375

ip-10-0-24-100.us-west-2.compute.internal [{"effect":"NoSchedule","key":"dedicated","value":"CheckoutApp"}]

NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
checkout-558f7777c-z5qvh         1/1     Running   0          19h   10.0.29.195   ip-10-0-24-100.us-west-2.compute.internal   <none>           <none>
checkout-redis-f54bf7cb5-r2sdp   1/1     Running   0          19h   10.0.19.67    ip-10-0-24-100.us-west-2.compute.internal   <none>           <none>

# 신규 노드 생성
Tue Apr  1 16:58:58 UTC 2025
NAME            TYPE       ZONE         NODE                                        READY     AGE
default-6css4   c4.large   us-west-2b   ip-10-0-24-100.us-west-2.compute.internal   True      19h
default-pflq6   c4.large   us-west-2b                                               Unknown   3s

NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-24-100.us-west-2.compute.internal   Ready    <none>   19h   v1.25.16-eks-59bf375

# 노드 Ready
Tue Apr  1 17:00:35 UTC 2025
NAME            TYPE       ZONE         NODE                                        READY   AGE
default-6css4   c4.large   us-west-2b   ip-10-0-24-100.us-west-2.compute.internal   True    19h
default-pflq6   c4.large   us-west-2b   ip-10-0-28-136.us-west-2.compute.internal   True    99s

NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-24-100.us-west-2.compute.internal   Ready    <none>   19h   v1.25.16-eks-59bf375
ip-10-0-28-136.us-west-2.compute.internal   Ready    <none>   27s   v1.26.15-eks-59bf375

# 신규 노드로 파드 생성
Tue Apr  1 17:00:35 UTC 2025
NAME            TYPE       ZONE         NODE                                        READY   AGE
default-6css4   c4.large   us-west-2b   ip-10-0-24-100.us-west-2.compute.internal   True    19h
default-pflq6   c4.large   us-west-2b   ip-10-0-28-136.us-west-2.compute.internal   True    99s

NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-24-100.us-west-2.compute.internal   Ready    <none>   19h   v1.25.16-eks-59bf375
ip-10-0-28-136.us-west-2.compute.internal   Ready    <none>   27s   v1.26.15-eks-59bf375

ip-10-0-24-100.us-west-2.compute.internal [{"effect":"NoSchedule","key":"dedicated","value":"CheckoutApp"},{"effect":"NoSchedule","key":"karpenter.sh/disruption","value":"disrupting"}]
ip-10-0-28-136.us-west-2.compute.internal [{"effect":"NoSchedule","key":"dedicated","value":"CheckoutApp"}]

NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES
checkout-558f7777c-z5qvh         1/1     Running   0          19h   10.0.29.195   ip-10-0-24-100.us-west-2.compute.internal   <none>           <none>
checkout-redis-f54bf7cb5-r2sdp   1/1     Running   0          19h   10.0.19.67    ip-10-0-24-100.us-west-2.compute.internal   <none>           <none>


Tue Apr  1 17:00:41 UTC 2025
NAME            TYPE       ZONE         NODE                                        READY   AGE
default-6css4   c4.large   us-west-2b   ip-10-0-24-100.us-west-2.compute.internal   True    19h
default-pflq6   c4.large   us-west-2b   ip-10-0-28-136.us-west-2.compute.internal   True    105s

NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-24-100.us-west-2.compute.internal   Ready    <none>   19h   v1.25.16-eks-59bf375
ip-10-0-28-136.us-west-2.compute.internal   Ready    <none>   33s   v1.26.15-eks-59bf375

ip-10-0-24-100.us-west-2.compute.internal [{"effect":"NoSchedule","key":"dedicated","value":"CheckoutApp"},{"effect":"NoSchedule","key":"karpenter.sh/disruption","value":"disrupting"}]
ip-10-0-28-136.us-west-2.compute.internal [{"effect":"NoSchedule","key":"dedicated","value":"CheckoutApp"}]

NAME                             READY   STATUS              RESTARTS   AGE   IP       NODE                                        NOMINATED NODE   READINESS GATES
checkout-558f7777c-hddnc         0/1     ContainerCreating   0          2s    <none>   ip-10-0-28-136.us-west-2.compute.internal   <none>           <none>
checkout-redis-f54bf7cb5-tqj6q   0/1     ContainerCreating   0          2s    <none>   ip-10-0-28-136.us-west-2.compute.internal   <none>           <none>

# 구 노드 사라짐
Tue Apr  1 17:00:46 UTC 2025
NAME            TYPE       ZONE         NODE                                        READY   AGE
default-pflq6   c4.large   us-west-2b   ip-10-0-28-136.us-west-2.compute.internal   True    109s

NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-28-136.us-west-2.compute.internal   Ready    <none>   37s   v1.26.15-eks-59bf375

ip-10-0-28-136.us-west-2.compute.internal [{"effect":"NoSchedule","key":"dedicated","value":"CheckoutApp"}]

NAME                             READY   STATUS              RESTARTS   AGE   IP       NODE                                        NOMINATED NODE   READINESS GATES
checkout-558f7777c-hddnc         0/1     ContainerCreating   0          6s    <none>   ip-10-0-28-136.us-west-2.compute.internal   <none>           <none>
checkout-redis-f54bf7cb5-tqj6q   0/1     ContainerCreating   0          6s    <none>   ip-10-0-28-136.us-west-2.compute.internal   <none>           <none>

Karpenter 동작의 세부 과정은 로그를 통해서 확인할 수 있습니다.

kubectl -n karpenter logs deployment/karpenter -c controller --tail=33 -f
...

# drift 진행 > nodeClaim 생성 > nodeClaim launch
{"level":"INFO","time":"2025-04-01T16:58:57.282Z","logger":"controller","message":"disrupting via drift replace, terminating 1 nodes (2 pods) ip-10-0-24-100.us-west-2.compute.internal/c4.large/spot and replacing with node from types c5.large, c4.large, m6a.large, r4.large, m6i.large and 40 other(s)","commit":"490ef94","controller":"disruption","command-id":"3a295ac9-2a0d-4ddf-a6cb-e8d08915cff2"}
{"level":"INFO","time":"2025-04-01T16:58:57.318Z","logger":"controller","message":"created nodeclaim","commit":"490ef94","controller":"disruption","NodePool":{"name":"default"},"NodeClaim":{"name":"default-pflq6"},"requests":{"cpu":"430m","memory":"632Mi","pods":"6"},"instance-types":"c4.2xlarge, c4.4xlarge, c4.8xlarge, c4.large, c4.xlarge and 40 other(s)"}
{"level":"INFO","time":"2025-04-01T16:59:00.052Z","logger":"controller","message":"launched nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-pflq6"},"namespace":"","name":"default-pflq6","reconcileID":"a314c3cc-925a-4039-a313-b10e3d762fed","provider-id":"aws:///us-west-2b/i-05f149a8dcf7d844d","instance-type":"c4.large","zone":"us-west-2b","capacity-type":"spot","allocatable":{"cpu":"1930m","ephemeral-storage":"17Gi","memory":"2878Mi","pods":"29"}}

# 노드 register > initialize
{"level":"INFO","time":"2025-04-01T17:00:10.779Z","logger":"controller","message":"registered nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-pflq6"},"namespace":"","name":"default-pflq6","reconcileID":"5a16a55e-9a6f-434f-b59e-c7daf0a93bf3","provider-id":"aws:///us-west-2b/i-05f149a8dcf7d844d","Node":{"name":"ip-10-0-28-136.us-west-2.compute.internal"}}
{"level":"INFO","time":"2025-04-01T17:00:33.021Z","logger":"controller","message":"initialized nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-pflq6"},"namespace":"","name":"default-pflq6","reconcileID":"c4e40956-a02a-4337-bd69-6b9be1d72d5f","provider-id":"aws:///us-west-2b/i-05f149a8dcf7d844d","Node":{"name":"ip-10-0-28-136.us-west-2.compute.internal"},"allocatable":{"cpu":"1930m","ephemeral-storage":"18242267924","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"3119300Ki","pods":"29"}}
{"level":"INFO","time":"2025-04-01T17:00:42.921Z","logger":"controller","message":"command succeeded","commit":"490ef94","controller":"disruption.queue","command-id":"3a295ac9-2a0d-4ddf-a6cb-e8d08915cff2"}

# 노드 삭제
{"level":"INFO","time":"2025-04-01T17:00:42.963Z","logger":"controller","message":"tainted node","commit":"490ef94","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-0-24-100.us-west-2.compute.internal"},"namespace":"","name":"ip-10-0-24-100.us-west-2.compute.internal","reconcileID":"73867503-c843-4373-b03e-a3406d6f60b3"}
{"level":"INFO","time":"2025-04-01T17:00:45.441Z","logger":"controller","message":"deleted node","commit":"490ef94","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-0-24-100.us-west-2.compute.internal"},"namespace":"","name":"ip-10-0-24-100.us-west-2.compute.internal","reconcileID":"03122384-44e7-4a0b-b28b-372ec6e10f1b"}
{"level":"INFO","time":"2025-04-01T17:00:45.808Z","logger":"controller","message":"deleted nodeclaim","commit":"490ef94","controller":"nodeclaim.termination","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-6css4"},"namespace":"","name":"default-6css4","reconcileID":"85396b01-d267-4a9a-b995-fcabaaf5e423","Node":{"name":"ip-10-0-24-100.us-west-2.compute.internal"},"provider-id":"aws:///us-west-2b/i-0cc37fd17692cedac"}

Karpenter의 Drift 방식으로 업그레이드가 완료되었습니다.

3.5. Self-managed 노드 업그레이드

Self-managed 노드 업그레이드는 사용자가 직접 AMI를 업데이트 해야하며, 이후 변경된 AMI ID를 업데이트 하는 방식으로 업그레이드를 수행합니다.

base.tf의 에서 self-managed 노드 그룹의 ami_id를 변경하고, terraform apply -auto-approve를 통해서 적용합니다.

self_managed_node_groups = {
  self-managed-group = {
    instance_type = "m5.large"

...

    # Additional configurations
    ami_id           = "ami-086414611b43bb691" # Replaced the latest AMI ID for EKS 1.26
    subnet_ids       = module.vpc.private_subnets
    .
    .
    .
    launch_template_use_name_prefix = true
  }
}

이후 노드가 변경된 것을 확인할 수 있습니다. 다만 terraform apply 이 종료된 것 처럼 보이지만, 실제 노드 그룹이 재생성 되는 시간은 조금 더 걸리는 것으로 확인됩니다. 관리형 노드 그룹은 terraform apply가 종료되는 시점과 업그레이드가 일치하지만, Self-managed 노드 업그레이드는 terraform apply가 종료되는 시점과 다르다는 점에 차이가 있습니다.

3.6. Fargate 노드 업그레이드

Fargate는 가상 머신 그룹을 직접 프로비저닝하거나 관리할 필요가 없습니다. 그러하므로, 업그레이드를 하려면 단순히 파드를 재시작해 Fargate 컨트롤러가 최신 쿠버네티스 버전으로 업데이트를 하도록 예약합니다.

아래와 같이 진행 합니다.

# 최초 상태
ec2-user:~/environment:$ kubectl get pods -n assets -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP            NODE                                                NOMINATED NODE   READINESS GATES
assets-7ccc84cb4d-2p284   1/1     Running   0          2d11h   10.0.37.152   fargate-ip-10-0-37-152.us-west-2.compute.internal   <none>           <none>
ec2-user:~/environment:$ kubectl get node $(kubectl get pods -n assets -o jsonpath='{.items[0].spec.nodeName}') -o wide
NAME                                                STATUS   ROLES    AGE     VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
fargate-ip-10-0-37-152.us-west-2.compute.internal   Ready    <none>   2d11h   v1.25.16-eks-2d5f260   10.0.37.152   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25

# 디플로이먼트 재시작
ec2-user:~/environment:$ kubectl rollout restart deployment assets -n assets
deployment.apps/assets restarted

신규 파드가 Running 상태가 되면 노드 또한 1.26.15로 변경된 것을 알 수 있습니다.

ec2-user:~/environment:$ kubectl get pods -n assets -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP           NODE                                               NOMINATED NODE   READINESS GATES
assets-66c4799cfc-4s7s6   1/1     Running   0          78s   10.0.28.67   fargate-ip-10-0-28-67.us-west-2.compute.internal   <none>           <none>
ec2-user:~/environment:$ kubectl get node $(kubectl get pods -n assets -o jsonpath='{.items[0].spec.nodeName}') -o wide
NAME                                               STATUS   ROLES    AGE   VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
fargate-ip-10-0-28-67.us-west-2.compute.internal   Ready    <none>   33s   v1.26.15-eks-2d5f260   10.0.28.67    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25

여기까지 In-place 클러스터 업그레이드를 살펴봤습니다.

모든 노드들이 1.26.15 버전으로 업그레이드가 완료되었습니다.

ec2-user:~/environment:$ kubectl get no 
NAME                                               STATUS   ROLES    AGE    VERSION
fargate-ip-10-0-28-67.us-west-2.compute.internal   Ready    <none>   10m    v1.26.15-eks-2d5f260
ip-10-0-28-136.us-west-2.compute.internal          Ready    <none>   27m    v1.26.15-eks-59bf375
ip-10-0-28-191.us-west-2.compute.internal          Ready    <none>   96m    v1.26.15-eks-59bf375
ip-10-0-3-227.us-west-2.compute.internal           Ready    <none>   60m    v1.26.15-eks-59bf375
ip-10-0-31-15.us-west-2.compute.internal           Ready    <none>   7m9s   v1.26.15-eks-59bf375
ip-10-0-35-1.us-west-2.compute.internal            Ready    <none>   12m    v1.26.15-eks-59bf375
ip-10-0-46-150.us-west-2.compute.internal          Ready    <none>   95m    v1.26.15-eks-59bf375

4. Blue/Green 클러스터 업그레이드

Blue/Green 클러스터 업그레이드는 앞서 살펴본 관리형 노드 그룹의 Blue/Green 업그레이드와 동일합니다. 신규 Green 클러스터를 생성하고, 이후 트래픽을 라우팅 하는 방식으로 업그레이드를 완료할 수 있습니다.

결국 In-place와의 차이점은 Blue/Green 클러스터는 별개의 클러스터이기 때문에 한번에 원하는 버전으로 업그레이드를 할 수 있다는 장점이 있습니다. 또한 기존 클러스터를 유지하여 간단하게 Rollback을 가능하게 합니다. 다만 동시에 2개 클러스터를 구성하게 되어 추가 비용이 발생할 수 있다는 점과 신규 클러스터로 Stateful 워크로드의 이전과 트래픽 라우팅의 복잡성이 존재합니다.

실습 방식 자체는 어렵지 않기 때문에 워크샵의 대략적인 개요만 설명드리겠습니다.

1) Green 클러스터 생성: Terraform 코드로 생성하고 배포 합니다. 사전에 적합한 쿠버네티스 버전과 대응하는 애드온을 업데이트 해야합니다.

2) Stateless 워크로드 마이그레이션: 상태가 없는 애플리케이션은 신규 클러스터에 배포합니다. 다만 업그레이드 버전에 deprecated API 등이 없는지 확인 후 미리 변경해야 합니다.

3) Stateful 워크로드 마이그레이션: Sateful 워크로드는 데이터 동기화 이슈가 있습니다. 이 때문에 사전에 스토리지 동기화나 데이터 동기화를 통해서 신규 클러스터에서 동일한 상태를 가지도록 해야 합니다. 이 부분은 간단하지 않기 때문에 많은 고민이 필요해 보입니다.

4) 트래픽 전환: 신규 클러스터 구성이 완료되면 트래픽을 Green으로 라우팅 합니다.

그럼 이상으로 EKS Upgrade 에 대한 포스트를 마무리 하겠습니다.

저작자표시

'EKS' 카테고리의 다른 글

[7] EKS Fargate (0)	2025.03.23
[6] EKS의 Security - EKS 인증/인가와 Pod IAM 권한 할당 (0)	2025.03.16
[5-2] EKS의 오토스케일링 Part2 (0)	2025.03.07
[5-1] EKS의 오토스케일링 Part1 (0)	2025.03.07
[4] EKS의 모니터링과 로깅 (0)	2025.03.01

Jenkins와 Argo CD를 활용한 Kubernetes 환경 CI/CD 구성

한명 2025. 3. 30. 04:37

2025. 3. 30. 04:37

이번 포스트에서는 Kubernetes 환경에서 애플리케이션 배포를 위한 CI(Continous Intergration)/CD(Continous Deployment) 구성을 예제를 통해서 살펴보겠습니다.

쿠버네티스를 사용하는 환경은 소소 코드를 코드 리파지터리에 반영하면, 이를 통해서 컨테이너 이미지를 빌드하고, 신규 컨테이너 이미지를 바탕으로 쿠버네티스에 워크로드가 배포가 이뤄집니다.

이 과정을 한땀 한땀 개발자의 PC에서 진행한다는 것도 어려운 일이고, 한편으로는 잦은 코드 변경에 따른 반복 작업으로 오히려 애플리케이션 개발 보다는 배포를 진행하는데 더 많은 시간을 소요할 수 있습니다. 또 각 절차에서 사용자가 개입하게 되면 오히려 실수에 의해 배포 문제로 서비스 이슈가 생긴다면 더 큰일입니다.

본 포스트에서 Jenkins를 통한 CI와 Argo CD를 통한 CD를 구성해보면서, CI/CD 과정을 어떻게 간결하고 자동화를 할 수 있는지 아이디어를 얻어가셨으면 좋겠습니다.

1. 실습 환경 구성

먼저 실습 환경은 아래와 같은 플로우를 완성하는데 목표를 두고 있습니다.

개발자가 코드를 개발팀 리파지터리에 배포하면, Jenkins가 CI 역할로 코드를 fetch하고 컨테이너 이미지를 빌드해 컨테이너 레지스트리(Docker Hub)로 업로드(Push)를 하고, 변경 결과를 DevOps팀의 소스 리파지터리에 반영합니다.

CD를 담당하는 Argo CD는 DevOps팀의 리파지터리의 변경 사항을 전달 받아, 대상 쿠버네티스가 Desired State와 일치하는지를 확인해 상태를 Sycn하게 됩니다. 이때, 신규 업로드된 컨테이너 이미지를 다운로드(Pull)해 신규 배포가 이뤄집니다.

본 실습에서 사용하는 각 구성요소는 아래와 같습니다.

SCM(Source Code Management): Gogs(https://gogs.io/)
컨테이너 이미지 레지스트리: Docker Hub(https://hub.docker.com/)
CI(Continous Intergration): Jenkins(https://www.jenkins.io/)
CD(Continous Deployment): Argo CD(https://argo-cd.readthedocs.io/)

실습에서 쿠버네티스 환경은 kind를 통하여 간단하게 구성하도록 하겠습니다.

Gogs, Jenkins은 docker compose로 구성하고, Argo CD는 쿠버네티스에 in-cluster 방식으로 설치하겠습니다.

Docker Hub 설정

먼저 컨테이너 이미지 저장소로 사용할 Docker Hub에 dev-app 리파지터리를 생성하고, 토큰을 발급하도록 하겠습니다.

토큰은 우측 상단 계정에서 Account settings에서 Personal access tokens를 통해서 생성합니다.

이때 Access permissions는 Read, Write, Delete로 선택합니다.

생성된 토큰을 잘 기록해 둡니다.

Kubernetes 환경 구성

먼저 실습을 위해 Windows 환경의 WSL을 구성하고, kind로 쿠버네티스를 생성하겠습니다.

# 클러스터 배포 전 확인
mkdir cicd-labs
cd ~/cicd-labs

# WSL2의 eth0 IP를 지정
ip -br -c a

MyIP=<각자 자신의 WSL2의 eth0 IP>
MyIP=172.28.157.42

# cicd-labs 디렉터리에서 아래 파일 작성
cat > kind-3node.yaml <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  apiServerAddress: "$MyIP"
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30000
    hostPort: 30000
  - containerPort: 30001
    hostPort: 30001
  - containerPort: 30002
    hostPort: 30002
  - containerPort: 30003
    hostPort: 30003
- role: worker
- role: worker
EOF
kind create cluster --config kind-3node.yaml --name myk8s --image kindest/node:v1.32.2

# 확인
kind get nodes --name myk8s
kind get nodes --name myk8s
myk8s-control-plane
myk8s-worker
myk8s-worker2

kubectl get no
NAME                  STATUS     ROLES           AGE   VERSION
myk8s-control-plane   Ready      control-plane   36s   v1.32.2
myk8s-worker          NotReady   <none>          19s   v1.32.2
myk8s-worker2         NotReady   <none>          19s   v1.32.2


# k8s api 주소 확인 
kubectl cluster-info
Kubernetes control plane is running at https://172.28.157.42:33215
CoreDNS is running at https://172.28.157.42:33215/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

해당 쿠버네티스 환경에 앞서 생성한 Docker Hub에 생성된 이미지를 가져오기 위해서 docker-registry 시크릿을 생성합니다.

# k8s secret : 도커 자격증명 설정 
kubectl get secret -A  # 생성 시 타입 지정

DHUSER=<도커 허브 계정>
DHPASS=<도커 허브 암호 혹은 토큰>
echo $DHUSER $DHPASS

kubectl create secret docker-registry dockerhub-secret \
  --docker-server=https://index.docker.io/v1/ \
  --docker-username=$DHUSER \
  --docker-password=$DHPASS

Gogs, Jenkins 설치

앞선 과정에서 kind로 클러스터를 설치하면 docker network에 kind라는 Bridge가 생성됩니다.

이 kind 네트워크를 활용해 Gogs, Jenkins를 아래와 같이 설치합니다. 참고로 이 실습에서는 쿠버네티스, Gogs, Jenkins가 모두 생성된 kind 네트워크를 사용하며, 노출된 IP를 WSL의 eth0로 설정하여 서로간 통신이 가능하도록 설정합니다.

# 작업 디렉토리로 이동
cd cicd-labs

# kind 설치를 먼저 진행하여 docker network(kind) 생성 후 아래 Jenkins,gogs 생성해야 합니다.
# docker network 확인 : kind 를 사용
docker network ls
...
d91da96d2114   kind      bridge    local
...

# 
cat <<EOT > docker-compose.yaml
services:

  jenkins:
    container_name: jenkins
    image: jenkins/jenkins
    restart: unless-stopped
    networks:
      - kind
    ports:
      - "8080:8080"
      - "50000:50000"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - jenkins_home:/var/jenkins_home

  gogs:
    container_name: gogs
    image: gogs/gogs
    restart: unless-stopped
    networks:
      - kind
    ports:
      - "10022:22"
      - "3000:3000"
    volumes:
      - gogs-data:/data

volumes:
  jenkins_home:
  gogs-data:

networks:
  kind:
    external: true
EOT


# 배포
docker compose up -d
[+] Running 21/21
 ✔ gogs Pulled                                                                                                                                                                        12.3s
 ✔ jenkins Pulled                                                                                                                                                                     33.5s

[+] Running 4/4
 ✔ Volume "cicd-labs_jenkins_home"  Created                                                                                                                                            0.0s
 ✔ Volume "cicd-labs_gogs-data"     Created                                                                                                                                            0.0s
 ✔ Container gogs                   Started                                                                                                                                            2.8s
 ✔ Container jenkins                Started

docker compose ps
NAME      IMAGE             COMMAND                  SERVICE   CREATED          STATUS                    PORTS
gogs      gogs/gogs         "/app/gogs/docker/st…"   gogs      58 seconds ago   Up 55 seconds (healthy)   0.0.0.0:3000->3000/tcp, :::3000->3000/tcp, 0.0.0.0:10022->22/tcp, [::]:10022->22/tcp
jenkins   jenkins/jenkins   "/usr/bin/tini -- /u…"   jenkins   58 seconds ago   Up 55 seconds             0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 0.0.0.0:50000->50000/tcp, :::50000->50000/tcp

Gogs 초기 설정

설치된 gogs를 초기화 하기 위해서 브라우저에서 아래와 같이 접근합니다.

http://127.0.0.1:3000/install

초기 설정을 위한 아래와 같은 화면이 표시됩니다.

초기 설정에서 아래와 같은 정보를 설정하였습니다.

Database: SQLite3
Application URL: http://<WSL의 eth0 IP>:3000/
Default Branch: main
Admin Account Settings: Username, Password, Admin Email 입력

이후 [Gogs 설치하기]를 통해서 설치를 진행하고 화면이 전환되면 관리자 계정을 통해서 접속합니다.

앞서 설명드린 개발팀 Repository와 DevOps팀 Repository를 생성하고, 이후 인증에 활용할 Token 생성을 진행합니다. (참고로 번역이 이상한 부분이 있을 수 있어 하단의 언어 설정을 영어로 변경하고 진행하시기 바랍니다)

리파지터리 생성

실습에 필요한 2개의 리파지터리를 생성하겠습니다.

[개발팀 리파지터리]

Repository Name: dev-app
Visibility: Private
.gitignore: Python
Readme: Default로 두고, initialize this repository with selected files and template 체크

[DevOps팀 리파지터리]

Repository Name: ops-deploy
Visibility: Private
.gitignore: Python
Readme: Default로 두고, initialize this repository with selected files and template 체크

생성을 완료하면 아래와 같은 환경이 생성됩니다.

토큰 생성

로컬 PC에서 git을 통해 Gogs로 접근하기 위해서 토큰을 생성하겠습니다.

Gogs의 우측 상단 계정을 눌러서 Your Settings>Application>Generate New Token을 통해서 아래와 같이 토큰을 생성하고, 생성된 Token 값을 기록해 둡니다.

이제 기본 코드를 작성하여 개발팀 리파지터리(dev-app)에 코드를 Push하겠습니다.

TOKEN=<각자 Gogs Token>
TOKEN=8cdf5569aedd230503abea67b0794b4d1e931c10 

MyIP=<각자 자신의 PC IP> # Windows (WSL2) 사용자는 자신의 WSL2 Ubuntu eth0 IP 입력 할 것!
MyIP=172.28.157.42

git clone http://devops:$TOKEN@$MyIP:3000/devops/dev-app.git
Cloning into 'dev-app'...
...

cd dev-app

# git 초기 설정
git --no-pager config --local --list
git config --local user.name "devops"
git config --local user.email "a@a.com"
git config --local init.defaultBranch main
git config --local credential.helper store
git --no-pager config --local --list

# 정보 확인
git --no-pager branch
* main
git remote -v
origin  http://devops:8cdf5569aedd230503abea67b0794b4d1e931c10@172.28.157.42:3000/devops/dev-app.git (fetch)
origin  http://devops:8cdf5569aedd230503abea67b0794b4d1e931c10@172.28.157.42:3000/devops/dev-app.git (push)

# server.py 파일 작성
cat > server.py <<EOF
from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler
from datetime import datetime
import socket

class RequestHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        match self.path:
            case '/':
                now = datetime.now()
                hostname = socket.gethostname()
                response_string = now.strftime("The time is %-I:%M:%S %p, VERSION 0.0.1\n")
                response_string += f"Server hostname: {hostname}\n"                
                self.respond_with(200, response_string)
            case '/healthz':
                self.respond_with(200, "Healthy")
            case _:
                self.respond_with(404, "Not Found")

    def respond_with(self, status_code: int, content: str) -> None:
        self.send_response(status_code)
        self.send_header('Content-type', 'text/plain')
        self.end_headers()
        self.wfile.write(bytes(content, "utf-8")) 

def startServer():
    try:
        server = ThreadingHTTPServer(('', 80), RequestHandler)
        print("Listening on " + ":".join(map(str, server.server_address)))
        server.serve_forever()
    except KeyboardInterrupt:
        server.shutdown()

if __name__== "__main__":
    startServer()
EOF


# (참고) python 실행 확인: 아래와 같이 /와 /healthz 에 대해서 응답하는 간단한 웹 서버
python3 server.py
Listening on 0.0.0.0:80
127.0.0.1 - - [29/Mar/2025 23:56:19] "GET / HTTP/1.1" 200 -
127.0.0.1 - - [29/Mar/2025 23:56:27] "GET /healthz HTTP/1.1" 200 -


# Dockerfile 생성
cat > Dockerfile <<EOF
FROM python:3.12
ENV PYTHONUNBUFFERED 1
COPY . /app
WORKDIR /app 
CMD python3 server.py
EOF


# VERSION 파일 생성
echo "0.0.1" > VERSION

# 결과 파일 확인
tree
.
├── Dockerfile
├── README.md
├── VERSION
└── server.py

# remote에 push 진행
git status
git add .
git commit -m "Add dev-app"
git push -u origin main
...

작업을 마치면 아래와 같이 파일이 반영된 것을 확인할 수 있습니다.

Jenkins 초기 설정

생성된 Jenkins 컨테이너에서 초기 패스워드를 확인하고 접속합니다.

# 작업 디렉토리로 이동
cd cicd-labs

# Jenkins 초기 암호 확인
docker compose exec jenkins cat /var/jenkins_home/secrets/initialAdminPassword
cf7605f7b5ff45349b65e9fc682ab5ca

# Jenkins 웹 접속 > 초기 암호 입력 > Plugin 설치 > admin 계정 정보 입력 
# > Jenkins URL에 WSL의 eth0 입력
웹 브라우저에서 http://127.0.0.1:8080 접속 

# (참고) 로그 확인 : 플러그인 설치 과정 확인
docker compose logs jenkins -f

이때 실습 과정에서는 Jenkins 에서 docker build를 수행하기 때문에 jenkins 내부에 docker 를 설치하겠습니다.

# Jenkins 컨테이너 내부에 도커 실행 파일 설치
docker compose exec --privileged -u root jenkins bash
-----------------------------------------------------
id

curl -fsSL https://download.docker.com/linux/debian/gpg -o /etc/apt/keyrings/docker.asc
chmod a+r /etc/apt/keyrings/docker.asc
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/debian \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  tee /etc/apt/sources.list.d/docker.list > /dev/null
apt-get update && apt install docker-ce-cli curl tree jq yq -y

# 확인 (아래 명령이 정상 수행되어야 함)
docker info
docker ps
which docker

# Jenkins 컨테이너 내부에서 root가 아닌 jenkins 유저도 docker를 실행할 수 있도록 권한을 부여
groupadd -g 1001 -f docker  # Windows WSL2(Container) >> cat /etc/group 에서 docker 그룹ID를 지정

chgrp docker /var/run/docker.sock
ls -l /var/run/docker.sock
usermod -aG docker jenkins
cat /etc/group | grep docker
docker:x:1001:jenkins

exit
--------------------------------------------

# Jenkins 컨테이너 재기동으로 위 설정 내용을 Jenkins app 에도 적용 필요
docker compose restart jenkins
[+] Restarting 1/1
 ✔ Container jenkins  Started     

# jenkins user로 docker 명령 실행 확인
docker compose exec jenkins id
uid=1000(jenkins) gid=1000(jenkins) groups=1000(jenkins),1001(docker)
docker compose exec jenkins docker info
docker compose exec jenkins docker ps

이후 Jenkins를 웹 화면으로 접속하여 해당 실습에서 사용할 플러그인을 설치합니다.

좌측 Jenkins 관리 > Plugins로 이동하여 Pipeline Stage View, Docker Pipeline, Gogs 를 각 설치합니다.

예를 들어, Available plugins에서 아래와 같이 검색하고 선택한 뒤 Install 진행합니다.

마지막으로 Jenkins에서 자격 증명 설정하겠습니다. 생성하는 자격 증명은 Jenkins에서 Gogs, Docker Hub, Kubernetes에 대해 접근에 사용되는 인증 정보를 담고 있습니다.

다시 좌측 메뉴의 Jenkins 관리 > Credentials로 이동합니다. 아래 Domains에서 global을 선택합니다.

Add Credentials를 통해서 각 자격 증명을 생성합니다.

1) Gogs 자격증명 ( gogs-crd)

Kind : Username with password
Username : devops
Password : *<토큰>*
ID : gogs-crd

2) 도커 허브 자격증명 (dockerhub-crd)

Kind : Username with password
Username : *<도커 계정명>*
Password : *<토큰>*
ID : dockerhub-crd

3) 쿠버네티스(kind) 자격증명 (k8s-crd)

Kind : Secret file
File : *<kubeconfig 파일>*
ID: k8s-crd

최종 아래와 같이 자격 증명이 생성되었습니다.

Argo CD 초기 설정

아래와 같이 설치된 kind 클러스터에 Argo CD를 배포합니다.

# 네임스페이스 생성 및 파라미터 파일 작성
cd cicd-labs

kubectl create ns argocd
cat <<EOF > argocd-values.yaml
dex:
  enabled: false

server:
  service:
    type: NodePort
    nodePortHttps: 30002
  extraArgs:
    - --insecure  # HTTPS 대신 HTTP 사용
EOF

# 설치
helm repo add argo https://argoproj.github.io/argo-helm
helm install argocd argo/argo-cd --version 7.8.13 -f argocd-values.yaml --namespace argocd # 7.7.10

# 확인
kubectl get pod,svc -n argocd
NAME                                                   READY   STATUS    RESTARTS   AGE
pod/argocd-application-controller-0                    1/1     Running   0          3m51s
pod/argocd-applicationset-controller-cccb64dc8-wsd7w   1/1     Running   0          3m51s
pod/argocd-notifications-controller-7cd4d88cd4-4s789   1/1     Running   0          3m51s
pod/argocd-redis-6c5698fc46-njwtf                      1/1     Running   0          3m51s
pod/argocd-repo-server-5f6c4f4cf4-d4twk                1/1     Running   0          3m51s
pod/argocd-server-7cb958f5fb-str77                     1/1     Running   0          3m51s

NAME                                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/argocd-applicationset-controller   ClusterIP   10.96.90.65    <none>        7000/TCP                     3m52s
service/argocd-redis                       ClusterIP   10.96.155.55   <none>        6379/TCP                     3m52s
service/argocd-repo-server                 ClusterIP   10.96.3.228    <none>        8081/TCP                     3m52s
service/argocd-server                      NodePort    10.96.84.116   <none>        80:30080/TCP,443:30002/TCP   3m52s

kubectl get crd | grep argo
applications.argoproj.io      2025-03-29T15:40:20Z
applicationsets.argoproj.io   2025-03-29T15:40:20Z
appprojects.argoproj.io       2025-03-29T15:40:20Z

# 최초 접속 암호 확인
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d ;echo
LM991sRTVwk7xTPu

웹 브라우저에서 http://127.0.0.1:30002 접속하고 admin에 확인된 암호로 접속합니다.

접근 후 Settings>Repositories에서 CONNECT REPO를 통해 앞서 생성한 Gogs의 ops-deploy 리파지터리를 연결합니다.

아래와 같이 연결정보를 입력합니다. 아래 password는 앞서 Gogs에서 생성한 Token을 사용하시면 됩니다.

아래와 같이 정상적으로 연결이 완료되어야 합니다.

여기까지 Docker Hub에 리파지터리를 생성하고, 이후 쿠버네티스, Gogs, Jenkins, Argo CD 를 설치하고 초기 설정을 진행하였습니다.

2. Jenkins를 통한 CI 구성

아래 Jenkins 화면을 통해서 기본적인 jenkins의 용어를 살펴보겠습니다.

Item이나 빌드라는 용어가 확인됩니다.

Jenkins에서 작업의 기본 단위를 Item이라고 합니다. 이를 Project, Job, Pipeline 등으로 표현하기도 합니다.

Item에는 아래와 같은 지시 사항을 포함합니다.

1) Trigger: 작업을 수행하는 시점 (작업 수행 Task가 언제 시작될지를 지시)

2) Build step: 작업을 구성하는 단계별 Task를 단계별 step으로 구성할 수 있습니다.

3) Post-build action: Task가 완료된 후 수행할 명령

이때 Jenkins에서는 '빌드'라는 용어로 해당 작업의 특정 실행 버전을 가집니다. 하나의 작업이 여러 번 실행된다고 할 때, 실행될 때마다 고유 빌드 번호가 부여됩니다. 작업 실행 중에 생성된 아티팩트, 콘솔 로그 등 특정 실행 버전과 관련된 세부 정보가 해당 빌드 번호로 저장됩니다.

수동으로 빌드하는 Item 생성

이제 Jenkins에서 CI를 수행하기 위해서 웹 화면의 [+ 새로운 Item]을 눌러 Item을 생성하겠습니다.

Item Name: pipeline-ci
Item type: Pipeline
pipeline script:

pipeline {
    agent any
    environment {
        DOCKER_IMAGE = '<자신의 도커 허브 계정>/dev-app' // Docker 이미지 이름
    }
    stages {
        stage('Checkout') {
            steps {
                 git branch: 'main',
                 url: 'http://<자신의 집 IP>:3000/devops/dev-app.git',  // Git에서 코드 체크아웃
                 credentialsId: 'gogs-crd'  // Credentials ID
            }
        }
        stage('Read VERSION') {
            steps {
                script {
                    // VERSION 파일 읽기
                    def version = readFile('VERSION').trim()
                    echo "Version found: ${version}"
                    // 환경 변수 설정
                    env.DOCKER_TAG = version
                }
            }
        }
        stage('Docker Build and Push') {
            steps {
                script {
                    docker.withRegistry('https://index.docker.io/v1/', 'dockerhub-crd') {
                        // DOCKER_TAG 사용
                        def appImage = docker.build("${DOCKER_IMAGE}:${DOCKER_TAG}")
                        appImage.push()
                        appImage.push("latest")
                    }
                }
            }
        }
    }
    post {
        success {
            echo "Docker image ${DOCKER_IMAGE}:${DOCKER_TAG} has been built and pushed successfully!"
        }
        failure {
            echo "Pipeline failed. Please check the logs."
        }
    }
}

이 과정은 크게 environment, stages, post로 나눠집니다. enviroment는 환경 변수를 정의한 것으로 이해할 수 있으며, post는 앞서 설명한 Post-build action를 정의하였습니다.

stage 단계를 다시 CHECKOUT > READ VERSION > Docker Build and Push 단계로, 지정된 git을 checkout하고 VERSION 파일을 읽어 DOCKER_TAG에 사용할 버전 정보로 사용하면서, 마지막으로 docker build와 push를 수행합니다.

또 앞서 초기 설정에서 생성한 자격 증명을 credentialsId로 참조하는 것을 알 수 있습니다.

위 스크립트를 수정하여 하단의 Pipleline의 Pipeline script에 입력하고 저장합니다.

이렇게 생성된 Item에서 [지금 빌드]를 수행하면 해당 Pipeline을 수동으로 수행할 수 있습니다.

Docker Hub에서도 새로 업로드된 이미지가 확인됩니다.

테스트로 디플로이먼트를 생성해보고, 정상적으로 파드가 실행되는지 확인합니다.

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: timeserver
spec:
  replicas: 2
  selector:
    matchLabels:
      pod: timeserver-pod
  template:
    metadata:
      labels:
        pod: timeserver-pod
    spec:
      containers:
      - name: timeserver-container
        image: docker.io/$DHUSER/dev-app:0.0.1
        livenessProbe:
          initialDelaySeconds: 30
          periodSeconds: 30
          httpGet:
            path: /healthz
            port: 80
            scheme: HTTP
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1
      imagePullSecrets:
      - name: dockerhub-secret
EOF

kubectl get po -w
NAME                          READY   STATUS              RESTARTS   AGE
timeserver-565559b4bf-pbd2q   0/1     ContainerCreating   0          14s
timeserver-565559b4bf-sd76g   0/1     ContainerCreating   0          14s
timeserver-565559b4bf-pbd2q   1/1     Running             0          62s
timeserver-565559b4bf-sd76g   1/1     Running             0          63s

kubectl get po -owide
NAME                          READY   STATUS    RESTARTS   AGE    IP           NODE            NOMINATED NODE   READINESS GATES
timeserver-565559b4bf-pbd2q   1/1     Running   0          110s   10.244.1.5   myk8s-worker2   <none>           <none>
timeserver-565559b4bf-sd76g   1/1     Running   0          110s   10.244.2.6   myk8s-worker    <none>           <none>


# 접속 테스트
kubectl run curl-pod --image=curlimages/curl:latest --command -- sh -c "while true; do sleep 3600; done"

kubectl exec -it curl-pod -- curl 10.244.1.5
The time is 4:57:46 PM, VERSION 0.0.1
Server hostname: timeserver-565559b4bf-pbd2q

수동 빌드를 통해서 생성된 이미지가 정상적으로 동작합니다.

자동 빌드 수행되는 Item 생성

이제 Jenkins의 Item에서 빌드가 자동으로 수행되도록, Gogs의 개발팀 리파지터리에 변경이 발생하면 Webhook을 통해서 Jenkins의 Item을 트리거 하도록 설정하겠습니다.

먼저 Jenkins에서 새로운 Item을 생성하고, 아래와 같이 pipeline script를 입력합니다.

Item name: SCM-Pipeline
Item type: Pipeline
GitHub project: http://172.28.157.42:3000/devops/dev-app (Gogs 리파지터리)
Gogs Webhook>Use Gogs secret: 임의로 설정
Triggers>Build when a change is pushed to Gogs 체크
Pipeline: Pipeline script from SCM 으로 설정해 해당 리파지터리의 Jenkinsfile을 사용하도록 합니다.

참고로 아래는 현재 실습 환경 구성 때문에 수행하는 내용으로 Gogs에서 Webhook을 동일한 IP로 수행하기 위해서 아래와 같이 설정합니다.

# gogs 컨테이너의 /data/gogs/conf/app.ini 파일 수정
[security]
INSTALL_LOCK = true
SECRET_KEY   = j2xaUPQcbAEwpIu
LOCAL_NETWORK_ALLOWLIST = 172.28.157.42 # WSL2 Ubuntu eth0 IP

안탁깝게도 gogs 이미지에 shell이 포함되어 있지 않아서 docker exec로는 수행이 어렵습니다. vscode의 Docker extension을 통해서 파일을 수정합니다.

이후 아래 명령으로 gogs 재시작 하시면 됩니다.

docker compose restart gogs

이제 Gogs의 Settings>Webhooks 에서 아래와 같이 webhook을 추가합니다.

Palyload URL: http://**:8080/gogs-webhook/?job=SCM-Pipeline/
Secret: 임의로 설정 (Gogs와 Jenkins간 동일한 Secret을 세팅해 서로 신뢰하도록 해야함)

마지막으로 해당 리파지터리에 실제로 Jenkinsfile을 작성해 Push하고, 설정된 Webhook을 통해서 정상 수행되는지 살펴보겠습니다.

# Jenkinsfile 빈 파일 작성
touch Jenkinsfile

# VERSION 파일 : 0.0.3 수정
# server.py 파일 : 0.0.3 수정

Jenkinsfile에 아래를 작성합니다.

pipeline {
    agent any
    environment {
        DOCKER_IMAGE = '<자신의 도커 허브 계정>/dev-app' // Docker 이미지 이름
    }
    stages {
        stage('Checkout') {
            steps {
                 git branch: 'main',
                 url: 'http://<자신의 집 IP>:3000/devops/dev-app.git',  // Git에서 코드 체크아웃
                 credentialsId: 'gogs-crd'  // Credentials ID
            }
        }
        stage('Read VERSION') {
            steps {
                script {
                    // VERSION 파일 읽기
                    def version = readFile('VERSION').trim()
                    echo "Version found: ${version}"
                    // 환경 변수 설정
                    env.DOCKER_TAG = version
                }
            }
        }
        stage('Docker Build and Push') {
            steps {
                script {
                    docker.withRegistry('https://index.docker.io/v1/', 'dockerhub-crd') {
                        // DOCKER_TAG 사용
                        def appImage = docker.build("${DOCKER_IMAGE}:${DOCKER_TAG}")
                        appImage.push()
                        appImage.push("latest")
                    }
                }
            }
        }
    }
    post {
        success {
            echo "Docker image ${DOCKER_IMAGE}:${DOCKER_TAG} has been built and pushed successfully!"
        }
        failure {
            echo "Pipeline failed. Please check the logs."
        }
    }
}

해당 파일을 push하여 Job이 수행되는지 확인합니다.

git add . && git commit -m "VERSION $(cat VERSION) Changed" && git push -u origin main

정상 수행되는 것으로 확인됩니다.

빌드된 이미지도 정상적으로 업로드되었습니다.

이로써 Jenkins를 통해서 Gogs의 개발팀 리파지터리에 변경 사항이 발생하면 자동으로 CI가 수행되도록 구성이 완료되었습니다.

현재 dev-app의 파일 구조는 아래와 같습니다.

tree
.
├── Dockerfile
├── Jenkinsfile
├── README.md
├── VERSION
└── server.py

Jenkinsfile을 개발팀 리파지터리에서 관리해야할 필요가 없다면, 이는 Jenkins의 Item 생성 시 Pipeline script from SCM 설정에서 다른 리파지터리를 참조하는 것도 좋을 것 같습니다.

3. Argo CD를 통한 CD 구성

앞서 CI 테스트에서는 수동으로 쿠버네티스 환경에 디플로이먼트를 배포했습니다. 이 과정을 Jenkins를 통해 CD 구성을 할 수도 있습니다.

예를 들어, 아래와 같이 stages에 k8s deployment blue version 단계를 수행하는 것과 같습니다.

pipeline {
    agent any

    environment {
        KUBECONFIG = credentials('k8s-crd')
    }

    stages {
        stage('Checkout') {
            steps {
                 git branch: 'main',
                 url: 'http://<자신의 집 IP>:3000/devops/dev-app.git',  // Git에서 코드 체크아웃
                 credentialsId: 'gogs-crd'  // Credentials ID
            }
        }

        stage('container image build') {
            steps {
                echo "container image build" // 생략
            }
        }

        stage('container image upload') {
            steps {
                echo "container image upload" // 생략
            }
        }

        stage('k8s deployment blue version') {
            steps {
                sh "kubectl apply -f ./deploy/echo-server-blue.yaml --kubeconfig $KUBECONFIG"
                sh "kubectl apply -f ./deploy/echo-server-service.yaml --kubeconfig $KUBECONFIG"
            }
        }

        stage('approve green version') {
            steps {
                input message: 'approve green version', ok: "Yes"
            }
        }

        stage('k8s deployment green version') {
            steps {
                sh "kubectl apply -f ./deploy/echo-server-green.yaml --kubeconfig $KUBECONFIG"
            }
        }

        stage('approve version switching') {
            steps {
                script {
                    returnValue = input message: 'Green switching?', ok: "Yes", parameters: [booleanParam(defaultValue: true, name: 'IS_SWITCHED')]
                    if (returnValue) {
                        sh "kubectl patch svc echo-server-service -p '{\"spec\": {\"selector\": {\"version\": \"green\"}}}' --kubeconfig $KUBECONFIG"
                    }
                }
            }
        }

        stage('Blue Rollback') {
            steps {
                script {
                    returnValue = input message: 'Blue Rollback?', parameters: [choice(choices: ['done', 'rollback'], name: 'IS_ROLLBACk')]
                    if (returnValue == "done") {
                        sh "kubectl delete -f ./deploy/echo-server-blue.yaml --kubeconfig $KUBECONFIG"
                    }
                    if (returnValue == "rollback") {
                        sh "kubectl patch svc echo-server-service -p '{\"spec\": {\"selector\": {\"version\": \"blue\"}}}' --kubeconfig $KUBECONFIG"
                    }
                }
            }
        }
    }
}

다만 이러한 과정은 Jenkins에 kubectl 같은 바이너리가 있어야 하거나, 혹은 다른 plugin을 사용해야 하는 불편함도 있을 수 있습니다.

또 결국에는 각 리소스에 대한 yaml 파일을 개발팀 리파지터리에 포함하다보니, 하나의 리파지터리에 파일을 관리하는 역할이 분리되지 않는 문제도 있을 수 있습니다.

다른 관점으로는 사용자가 임의로 클러스터의 오브젝트를 컨트롤 하는 상황을 배제하고 싶은 경우도 있습니다.

예를 들어, 배포는 끝났고 운영 중의 상태이지만, 사용자가 디플로이와 같은 오브젝트를 변경한다면 이것은 배포 시점의 상태와는 다른 상태가 됩니다.

즉, GitOps 관점에서 선언적인 상태를 Git 리파지터리에 정의(Desired Manifest)하고, 운영 중인 상태(Live Manifest)를 항상 유지하도록 하는 방식이 필요합니다.

이러한 방식을 Argo CD를 통해서 구현할 수 있습니다.

먼저 Gogs의 ops-deploy에서 Settings>Webhooks에서 Webhook을 추가합니다.

Palyload URL: http://:30002/api/webhook

이후 Gogs의 DevOps팀 리파지터리에 실습에 필요한 파일을 작성하여 Push합니다.

cd cicd-labs

TOKEN=<>
MyIP=172.28.157.42
git clone http://devops:$TOKEN@$MyIP:3000/devops/ops-deploy.git
cd ops-deploy

# git 기본 설정
git --no-pager config --local --list
git config --local user.name "devops"
git config --local user.email "a@a.com"
git config --local init.defaultBranch main
git config --local credential.helper store
git --no-pager config --local --list
cat .git/config

# git 확인
git --no-pager branch
 -v* main

git remote -v
origin  http://devops:8cdf5569aedd230503abea67b0794b4d1e931c10@172.28.157.42:3000/devops/ops-deploy.git (fetch)
origin  http://devops:8cdf5569aedd230503abea67b0794b4d1e931c10@172.28.157.42:3000/devops/ops-deploy.git (push)


# 폴더 생성
mkdir dev-app

# 도커 계정 정보
DHUSER=<도커 허브 계정>

# 버전 정보 
VERSION=0.0.1

# VERSION, yaml 파일 생성
cat > dev-app/VERSION <<EOF
$VERSION
EOF

cat > dev-app/timeserver.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: timeserver
spec:
  replicas: 2
  selector:
    matchLabels:
      pod: timeserver-pod
  template:
    metadata:
      labels:
        pod: timeserver-pod
    spec:
      containers:
      - name: timeserver-container
        image: docker.io/$DHUSER/dev-app:$VERSION
        livenessProbe:
          initialDelaySeconds: 30
          periodSeconds: 30
          httpGet:
            path: /healthz
            port: 80
            scheme: HTTP
          timeoutSeconds: 5
          failureThreshold: 3
          successThreshold: 1
      imagePullSecrets:
      - name: dockerhub-secret
EOF

cat > dev-app/service.yaml <<EOF
apiVersion: v1
kind: Service
metadata:
  name: timeserver
spec:
  selector:
    pod: timeserver-pod
  ports:
  - port: 80
    targetPort: 80
    protocol: TCP
    nodePort: 30000
  type: NodePort
EOF

# Git Push
git add . && git commit -m "Add dev-app deployment yaml" && git push -u origin main

Argo CD는 Application이라는 CRD를 통해서 쿠버네티스 클러스터에 배포할 선언적 설정과 이를 동기화 하는 방법을 정의합니다.

아래와 같이 ops-deploy를 바라보는 Application을 생성합니다.

# Application 생성
cat <<EOF | kubectl apply -f -
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: timeserver
  namespace: argocd
  finalizers:
  - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    path: dev-app
    repoURL: http://$MyIP:3000/devops/ops-deploy
    targetRevision: HEAD
  syncPolicy:
    automated:
      prune: true
    syncOptions:
    - CreateNamespace=true
  destination:
    namespace: default
    server: https://kubernetes.default.svc
EOF

Application 을 생성하면 바로 Argo CD에서 ops-deploy의 yaml을 바탕으로 sync를 하는 것을 알 수 잇습니다.

아래와 같이 명령으로 정상 실행되었는지 확인할 수 있습니다.

# 확인
kubectl get applications -n argocd timeserver
NAME         SYNC STATUS   HEALTH STATUS
timeserver   Synced        Healthy

# 서비스 테스트
curl http://127.0.0.1:30000
The time is 6:40:06 PM, VERSION 0.0.1
Server hostname: timeserver-565559b4bf-sd76g
curl http://127.0.0.1:30000/healthz
Healthy

조금 더 개선해보기

여기서 더 나아가, 개발팀 리파지터리에 반영이 되면 변경 내용이 DevOps팀 리파지터리에 반영되고, 이 변경을 통해서 쿠버네티스 클러스터에 Sync가 되도록 최종 반영해 보겠습니다.

개발팀 리파지터리(dev-app)의 Jenkinsfile을 아래와 같이 수정합니다.

기존 Jenkins pipeline에서 ops-deploy Checkout를 통해 ops-deploy를 checkout하고, 변경된 VERSION 정보를 반영해 ops-deploy version update push 과정에서 ops-deploy 로 push하는 것을 확인할 수 있습니다.

pipeline {
    agent any
    environment {
        DOCKER_IMAGE = '<자신의 도커 허브 계정>/dev-app' // Docker 이미지 이름
        GOGSCRD = credentials('gogs-crd')
    }
    stages {
        stage('dev-app Checkout') {
            steps {
                 git branch: 'main',
                 url: 'http://<자신의 집 IP>:3000/devops/dev-app.git',  // Git에서 코드 체크아웃
                 credentialsId: 'gogs-crd'  // Credentials ID
            }
        }
        stage('Read VERSION') {
            steps {
                script {
                    // VERSION 파일 읽기
                    def version = readFile('VERSION').trim()
                    echo "Version found: ${version}"
                    // 환경 변수 설정
                    env.DOCKER_TAG = version
                }
            }
        }
        stage('Docker Build and Push') {
            steps {
                script {
                    docker.withRegistry('https://index.docker.io/v1/', 'dockerhub-crd') {
                        // DOCKER_TAG 사용
                        def appImage = docker.build("${DOCKER_IMAGE}:${DOCKER_TAG}")
                        appImage.push()
                        appImage.push("latest")
                    }
                }
            }
        }
        stage('ops-deploy Checkout') {
            steps {
                 git branch: 'main',
                 url: 'http://<자신의 집 IP>:3000/devops/ops-deploy.git',  // Git에서 코드 체크아웃
                 credentialsId: 'gogs-crd'  // Credentials ID
            }
        }
        stage('ops-deploy version update push') {
            steps {
                sh '''
                OLDVER=$(cat dev-app/VERSION)
                NEWVER=$(echo ${DOCKER_TAG})
                sed -i "s/$OLDVER/$NEWVER/" dev-app/timeserver.yaml
                sed -i "s/$OLDVER/$NEWVER/" dev-app/VERSION
                git add ./dev-app
                git config user.name "devops"
                git config user.email "a@a.com"
                git commit -m "version update ${DOCKER_TAG}"
                git push http://${GOGSCRD_USR}:${GOGSCRD_PSW}@<자신의 집 IP>:3000/devops/ops-deploy.git
                '''
            }
        }
    }
    post {
        success {
            echo "Docker image ${DOCKER_IMAGE}:${DOCKER_TAG} has been built and pushed successfully!"
        }
        failure {
            echo "Pipeline failed. Please check the logs."
        }
    }
}

개발팀 코드 배포는 Jenkins의 Job을 트리거하고, Jenkins Pipeline을 통해 컨테이너 이미지 빌드 작업을 수행하고, 다시 ops-deploy에 버전 변경 내용을 반영 시켜, 최종 쿠버네티스 환경에 Sync 되도록 Argo CD가 동작합니다.

이제 아래와 같이 VERSION 정보까지 수정하고 push를 진행합니다.

# VERSION 파일 수정 : 0.0.3
# server.py 파일 수정 : 0.0.3

# git push : VERSION, server.py, Jenkinsfile
git add . && git commit -m "VERSION $(cat VERSION) Changed" && git push -u origin main

테스트 과정에서 몇 가지 에러가 발생하였지만, 최종 잘 반영되는 것으로 확인됩니다.

ArgoCD에서도 Sync를 통해서 다른 ReplicaSet이 생성되는 것을 확인할 수 있습니다.

이것으로 실습을 마무리하겠습니다.

마무리

실습을 통해 Jenkins를 통한 CI와 Argo CD를 통한 CD를 구성해보았습니다.

쿠버네티스 환경에서 CI/CD 과정을 어떻게 간결하고 자동화를 할 수 있는지 대략적인 아이디어를 얻어가셨으면 좋겠습니다.

저작자표시

'Kubernetes' 카테고리의 다른 글

CNI(Container Network Interface)란? (0)	2025.02.19
쿠버네티스에 containerd 를 사용하는 윈도우 워커노드 추가 (with Calico CNI) (0)	2022.07.12
쿠버네티스 윈도우 워커 노드 추가 (with Calico CNI) (2)	2022.03.14
Kubernetes 업그레이드 (K8S v1.21.x → v1.22.x) (0)	2022.02.05

[7] EKS Fargate

한명 2025. 3. 23. 02:39

2025. 3. 23. 02:39

이번 포스트에서는 EKS의 Fargate에 대해서 살펴보겠습니다.

EKS Fargate는 EKS의 노드 그룹을 사용하지 않고 컨테이너를 서버리스 컴퓨팅 엔진에 실행하는 방식입니다.

먼저 EKS Fargate를 살펴보고, 이와 유사한 AKS의 Virtual Nodes를 통해 각 Managed Kubernetes Service에서 노드를 사용하지 않고 컨테이너를 실행하기 위한 구현 방식을 살펴보고, 실습을 통해 확인해보습니다.

1. EKS Fargate

일반적으로 EKS에서는 노드 그룹을 생성하여 워커 노드를 사용할 수 있습니다. EKS의 컴퓨팅을 제공하는 옵션 중 노드인 EC2 인스턴스를 활용하지 않는 방식으로 EKS Fargate가 있습니다.

먼저 AWS Fargate를 이해하기 위해서 Amazon ECS를 먼저 살펴보겠습니다.

AWS에서 컨테이너를 실행하는 방식 중 하나로 Amazon ECS(Elastic Container Service)라는 완전 관리형 컨테이너 오케스트레이션 서비스를 제공하고 있습니다. 사용자는 Amazon ECS를 통해서 컨테이너화된 애플리케이션을 쉽게 배포하고 관리할 수 있습니다.

Amazon ECS는 아래와 같이 세 가지 계층을 가지고 있는데, 이 중 ECS가 실행되는 인프라를 의미하는 Capacity options에 AWS Fargate가 있다는 것을 알 수 있습니다.

출처: https://docs.aws.amazon.com/ko_kr/AmazonECS/latest/developerguide/Welcome.html

ECS의 용량 옵션에서 EC2를 선택하면 실제 EC2 인스턴스를 통해 컨테이너가 실행됩니다. 반면 Fargate는 서버리스 종량제 컴퓨팅 엔진을 의미합니다. 즉 가상 머신 자체를 배포하지 않는 형태이기 때문에 경량이라는 장점이 있습니다.

EKS의 Fargate도 동일합니다. EKS는 노드 그룹을 통해서 EC2를 통해 사용자 워커 노드를 제공하는데, 서버리스 컴퓨팅 엔진인 Fargate를 활용할 수 있습니다.

아래와 같이 EKS의 파드가 실행되는 Data Plane을 위한 개별 옵션입니다.

출처: https://www.eksworkshop.com/docs/fundamentals/fargate/

Fargate와 같은 컴퓨팅 옵션은 보통 지속적으로 실행하지 않아도 되는 유형이면서, stateless 한 애플리케이션에 적합합니다. 특정 Job을 수행하고 종료하는 워크로드 혹은 빠른 배포가 필요하고 필요없는 경우 종료가 가능한 유형의 워크로드라면 서버리스 컴퓨팅 엔진을 활용하는 Fargate를 고려할 수 있습니다.

실습을 통해서 EKS Fargate를 더 살펴보겠습니다.

해당 실습은 Amazone EKS Blueprints for Terraform의 예제를 통해서 진행하겠습니다.

참고: https://aws-ia.github.io/terraform-aws-eks-blueprints/

# 테라폼 코드 가져오기
git clone https://github.com/aws-ia/terraform-aws-eks-blueprints
cd terraform-aws-eks-blueprints/patterns/fargate-serverless

# 테라폼 초기화
terraform init

# 테라폼 Plan 확인
terraform plan

# 테라폼 배포
# 배포 : EKS, Add-ons, fargate profile - 13분 소요
terraform apply -auto-approve


# 배포 완료 후 확인
terraform state list
module.eks.data.aws_caller_identity.current
...

terraform output
...

생성된 리소스를 살펴보면 fargate 형태의 노드가 4대 확인되며, 또한 파드가 각 노드에 실행 중인 것을 알 수 있습니다.

이때 파드 IP와 노드 IP가 같은 것을 알 수 있는데, EKS fargate에서는 각 파드를 위해서 하나의 fargate노드가 실행되는 구조라는 것을 알 수 있습니다.

# kubeconfig 획득
aws eks --region us-west-2 update-kubeconfig --name fargate-serverless

# 노드, 파드 정보 확인
kubectl get no -o wide
NAME                                                STATUS   ROLES    AGE   VERSION               INTERNAL-IP   EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
fargate-ip-10-0-1-239.us-west-2.compute.internal    Ready    <none>   48m   v1.30.8-eks-2d5f260   10.0.1.239    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25
fargate-ip-10-0-18-94.us-west-2.compute.internal    Ready    <none>   48m   v1.30.8-eks-2d5f260   10.0.18.94    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25
fargate-ip-10-0-20-74.us-west-2.compute.internal    Ready    <none>   48m   v1.30.8-eks-2d5f260   10.0.20.74    <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25
fargate-ip-10-0-35-232.us-west-2.compute.internal   Ready    <none>   48m   v1.30.8-eks-2d5f260   10.0.35.232   <none>        Amazon Linux 2   5.10.234-225.910.amzn2.x86_64   containerd://1.7.25

kubectl get pod -A -o wide
NAMESPACE     NAME                                           READY   STATUS    RESTARTS   AGE   IP            NODE
kube-system   aws-load-balancer-controller-c946d85dd-2n65t   1/1     Running   0          48m   10.0.35.232   fargate-ip-10-0-35-232.us-west-2.compute.internal   <none>           <none>
kube-system   aws-load-balancer-controller-c946d85dd-2t662   1/1     Running   0          48m   10.0.18.94    fargate-ip-10-0-18-94.us-west-2.compute.internal    <none>           <none>
kube-system   coredns-69fd949db7-95njt                       1/1     Running   0          49m   10.0.20.74    fargate-ip-10-0-20-74.us-west-2.compute.internal    <none>           <none>
kube-system   coredns-69fd949db7-b5jpf                       1/1     Running   0          49m   10.0.1.239    fargate-ip-10-0-1-239.us-west-2.compute.internal    <none>           <none>

노드 정보를 살펴보면 comput-type에 대해서 Label과 Taint가 적용된 것을 알 수 있습니다.

kubectl describe node | grep -A 3 Labels
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/compute-type=fargate
...
kubectl describe node | grep Taints
Taints:             eks.amazonaws.com/compute-type=fargate:NoSchedule
...

EKS에서 Fargate를 사용하기 위해서 Fargate Profile을 생성해야 합니다. 이 프로파일은 Fargate를 사용할 리소스의 네임스페이스와 Label을 사전에 지정(selectors)합니다. 또한 파드가 배포되는 서브넷과 IAM Role에 대한 정보도 Fargate Profile에 포함됩니다.

출처: https://aws.amazon.com/ko/blogs/containers/use-cloudformation-to-automate-management-of-the-fargate-profile-in-amazon-eks/

실습의 terraform 코드에서는 아래와 같이 fargate profile을 지정한 것을 알 수 있습니다.


  fargate_profiles = {
    app_wildcard = {
      selectors = [
        { namespace = "app-*" }
      ]
    }
    kube_system = {
      name = "kube-system"
      selectors = [
        { namespace = "kube-system" }
      ]
    }
  }

웹 콘솔에서는 아래와 같이 확인할 수 있습니다.

kube-system 이라는 프로파일로 인해서 실제로 kube-system 또한 관리형 노드 그룹을 사용하지 않고 모두 fargate 형태로 실행되었습니다.

API 서버에 Fargate에 해당하는 파드가 요청되면 Admission controller에 의해 Mutating Webhook으로 Fargate로 스케줄링되도록 정보가 변경됩니다.

이 과정을 세부적으로 살펴보면 아래와 같이 파드가 요청되면 Mutating Webhook에 의해서 Fargate Profile에 대한 정보와 schedulerName이 Fargate-scheduler로 지정됩니다. 이 정보를 바탕으로 Fargate Scheduler는 Fagate 환경에 파드가 스케줄링하고 파드가 실행됩니다.

출처: https://aws.amazon.com/ko/blogs/containers/the-role-of-aws-fargate-in-the-container-world/

이를 coredns 파드를 통해서 살펴보면 아래와 같이 fargate-profile과 또한 schedulerName이 지정된 것을 확인할 수 있습니다.

kubectl get po -n kube-system   coredns-69fd949db7-95njt -oyaml |grep fargate
    eks.amazonaws.com/fargate-profile: kube-system
  nodeName: fargate-ip-10-0-20-74.us-west-2.compute.internal
  schedulerName: fargate-scheduler

이렇게 요청된 파드의 정보가 Fargate Profile의 Selector에서 지정한 정보와 일치하는 지를 바탕으로 스케줄링을 수행하기 때문에 Fargate로 스케줄링된 리소스는 일반 노드에는 배포되지 않습니다.

일반 노드에 실행되는 워크로드와 Fargate에 실행되는 워크로드는 스케줄링에 있어 배타적인 관계입니다. 예를 들어, 노드가 부족한 경우라도 파드가 Fargate로 Burst해서 실행할 수 있는 구조가 아닙니다.

Fargate 자체는 사용자가 생성한 노드 리소스가 아니기 때문에 EC2 인스턴스에서는 인스턴스가 확인되지 않습니다.

이때 Network Interface는 확인이 가능합니다. 다만 아래의 정보와 같이 Network Interface이 Owner와 Instance의 Owner가 다르다는 것을 알 수 있습니다.

EKS Fargate가 사용자 VPC와 연계되는 방식은 아래와 같은 형태로 구성됩니다.

출처: https://www.kiranjthomas.com/posts/fargate-under-the-hood/

1) Fargate를 위한 EC2 인스턴스가 별도의 Fargate VPC에서 실행됩니다.

2) 이 인스턴스의 Primary Network Interface는 Fargate VPC에 위치하여, Container Runtime, Fargate Agent, Guest Kernel& OS를 위한 네트워크 트래픽을 처리합니다.

3) 이 인스턴스의 Secondary Network Interface가 사용자 VPC에 연결되어 컨테이너간 통신과 Image Pulling과 같은 네트워크 트래픽을 처리합니다.

위의 그림과 설명에서는 Fargate가 EC2로 표현되어 있지만 이는 Lightweight VM으로 알려진 Firecracker를 사용하고 있습니다.

EKS의 Fargate는 EC2 인스턴스를 유지하지 않아도 되기 때문에 비용 효과적이라고 생각할 수 있지만, 일반적으로 Fargate는 동일한 용량의 EC2에 비해서는 비용이 더 비싸게 책정됩니다. 이는 실행되는 파드를 위해서 노드에서 실행되는 kube-proxy, containerd, kubelet 컴포넌트가 배포되어 일부 추가적인 리소스를 사용하기 때문입니다.

이를 아래 장표에서 보시면 256MB 정도가 추가되는 것을 확인할 수 있습니다.

출처: https://www.youtube.com/watch?v=N0uLK5syctU

추가로 이러한 리소스는 Fargate 리소스 타입에 맞춰 반올림되어 구성되기 때문에, 실제 파드 spec의 request 용량보다 큰 사이즈의 Fargate 리소스가 사용되는 점도 아실 필요가 있습니다.

그러하므로 EKS의 Fargate 옵션은 비용 측면보다는 서버리스 워크로드에 적합한지 여부를 바탕으로 판단할 필요가 있습니다.

또한 EKS Fargate에는 다수 고려사항이 있으므로, 제약사항을 문서를 통해 사전에 확인하시기 바랍니다.

https://docs.aws.amazon.com/ko_kr/eks/latest/userguide/fargate.html#fargate-considerations

아래와 같이 리소스를 정리하고 실습을 마무리 하겠습니다.

terraform destroy -auto-approve

2. AKS Virtual Nodes

AKS에서는 노드를 사용하지 않고 Virtual Nodes를 사용하여 파드를 실행할 수 있습니다.

Azure에서는 ACI(Azure Container Instance)라는 서버리스 컨테이너 서비스를 가지고 있습니다(이는 AWS의 ECS와 같은 서비스 입니다). AKS에서 Virtual Nodes를 통해 파드를 실행하면 파드는 ACI의 형태로 실행된다고 볼 수 있습니다.

참고: https://learn.microsoft.com/ko-kr/azure/container-instances/container-instances-overview

AKS에서 Virtual Nodes를 사용하면 실제로 노드를 확인 했을 때 Virtual Nodes가 추가되는 형태로 보이는데, 이는 Virtual Kubelet이라는 오픈 소스를 기반으로 동작합니다.

Virtual Kubelet은 kubelet과 같이 동작하면서 쿠버네티스가 다른 API와 연계되도록 동작합니다. 이 방식을 통해서 다른 ACI, AWS Fargate 등과 같은 서비스를 통해서 노드를 사용하는 것 처럼 할 수 있습니다.

아래 그림은 Virtual Kubelet의 동작 방식으로, Virtual Kublet은 kubelet과 같이 자신을 노드로 등록하여, 실제로 파드가 Virtual Node에 스케줄링될 수도록 API를 구현하고 있습니다.

출처: https://github.com/virtual-kubelet/virtual-kubelet?tab=readme-ov-file

AKS에서 Virtual Nodes를 사용하게 되면 Virtual Nodes에 스케줄링이 되고, virtual kubelet이 ACI와 연계하여 파드를 실행하는 방식으로 동작하게 됩니다.

AKS에서는 addon 형태로 Virtual Nodes를 지원합니다.

아래 실습 문서를 바탕으로 진행하면서 AKS Virtual Nodes에 대해서 살펴보겠습니다.

https://docs.azure.cn/en-us/aks/virtual-nodes-cli

# 변수 선언
PREFIX=aks-vn
RG=${PREFIX}-rg
AKSNAME=${PREFIX}
LOC=koreacentral
VNET=aks-vnet
AKSSUBNET=aks-subnet
VNSUBNET=vn-subnet

# 리소스 그룹 생성
az group create --name $RG --location $LOC -o none

az network vnet create --resource-group $RG --name $VNET --address-prefixes 10.0.0.0/8 --subnet-name $AKSSUBNET --subnet-prefix 10.240.0.0/16 -o none
az network vnet subnet create --resource-group $RG --vnet-name $VNET --name $VNSUBNET --address-prefixes 10.241.0.0/16 -o none

SUBNET_ID=$(az network vnet subnet show --resource-group $RG --vnet-name $VNET --name $AKSSUBNET --query id -o tsv)

# AKS 클러스터 설치
az aks create --resource-group $RG --name $AKSNAME --node-count 2 --network-plugin azure --vnet-subnet-id $SUBNET_ID --generate-ssh-keys

# 노드 정보 확인
az aks get-credentials --resource-group $RG --name $AKSNAME
kubectl get nodes
NAME                                STATUS   ROLES    AGE    VERSION
aks-nodepool1-14565790-vmss000000   Ready    <none>   100s   v1.30.9
aks-nodepool1-14565790-vmss000001   Ready    <none>   100s   v1.30.9

AKS를 생성하면 기본 노드 2대가 확인됩니다. EKS는 addon 컴포넌트들도 Fargate로 실행될수 있는 반면, AKS는 기본적인 시스템 컴포넌트는 여전히 일반 노드에서 실행이 필요합니다.

이제 Virtual Nodes addon을 활성화하고 다시 노드를 살펴보면 virtual node에 해당하는 노드가 확인됩니다.

# Virtual Nodes addon 활성화
az aks enable-addons --resource-group $RG --name $AKSNAME --addons virtual-node --subnet-name $VNSUBNET

# 노드 정보 확인
kubectl get nodes
NAME                                STATUS   ROLES    AGE     VERSION
aks-nodepool1-14565790-vmss000000   Ready    <none>   14m     v1.30.9
aks-nodepool1-14565790-vmss000001   Ready    <none>   14m     v1.30.9
virtual-node-aci-linux              Ready    agent    2m51s   v1.25.0-vk-azure-aci-1.6.2

실행 중인 파드를 살펴보면 aci-connector-linux라는 파드가 실행되는 것을 알 수 있는데, virtual kubelet의 역할을 수행하며 AKS 클러스터와 ACI의 Management API 간의 가교 역할을 수행합니다.

아래 명령으로 살펴보면 aci-connector-linux 와 노드의 IP가 10.240.0.32으로 동일한 것을 알 수 있습니다.

kubectl get po -A -owide
NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
kube-system   aci-connector-linux-79d9bf8946-7hv8s   1/1     Running   0          17m     10.240.0.32   aks-nodepool1-14565790-vmss000001   <none>           <none>
..

kubectl get no -A -owide
NAME                                STATUS   ROLES    AGE   VERSION                      INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
..
virtual-node-aci-linux              Ready    agent    15m   v1.25.0-vk-azure-aci-1.6.2   10.240.0.32   <none>        <unknown>            <unknown>           <unknown>

또한 포탈에서 확인해보면 AKS 노드를 위한 서브넷과 다르게, Virtual Node를 위해 생성된 서브넷은 실제로 ACI에서 배포를 진행하게 되므로 Azure Container Instance에 위임된 상태임을 알 수 있습니다.

EKS에서는 Fargate Profile을 생성하고, 특정 파드가 이 프로파일에 적용 가능하면 Fargate Scheduler에 의해서 Fargate로 배포가 되는 형태였습니다.

반면 Virtual Nodes에는 기본적으로 아래와 같은 Taint가 적용되어 있고, 기본적인 Taint, Toleration 방식을 통해서 일반 노드나 혹은 Virtual Nodes로 배포되도록 할 수 있습니다. 이는 일반적인 스케줄링 기법과 다르지 않습니다.

$ kubectl describe no virtual-node-aci-linux |grep -A 1 -B 1 Taint
CreationTimestamp:  Sat, 22 Mar 2025 15:57:33 +0000
Taints:             virtual-kubelet.io/provider=azure:NoSchedule
Unschedulable:      false

그러하므로 Virtual Nodes에 실행되는 워크로드는 Toleration이 필요합니다. 만약 파드의 스케줄링을 Virtual nodes로 강제하지 않으면 일반 노드에서도 실행될 수 있다는 것을 알 수 있습니다.

아래로 샘플 애플리케이션을 배포해서 실제로 어떻게 배포되는지 살펴보겠습니다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aci-helloworld
spec:
  replicas: 4
  selector:
    matchLabels:
      app: aci-helloworld
  template:
    metadata:
      labels:
        app: aci-helloworld
    spec:
      containers:
      - name: aci-helloworld
        image: mcr.microsoft.com/azuredocs/aci-helloworld
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 200m
      tolerations:
      - key: virtual-kubelet.io/provider
        operator: Exists
      - key: azure.com/aci
        effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet

파드의 Toleration과 Afffinity를 살펴볼 필요가 있습니다. 먼저 Virtual Nodes의 Taint에 대한 toleration이 지정되어 있습니다.

      tolerations:
      - key: virtual-kubelet.io/provider
        operator: Exists
      - key: azure.com/aci
        effect: NoSchedule

이 경우에는 파드가 바로 Virtual Nodes로 배포될 수 있으므로, 아래와 같이 nodeAffinity를 임의로 지정했습니다.

      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                  - key: type
                    operator: NotIn
                    values:
                      - virtual-kubelet

이렇게 배포하면 nodeAffinity에 따라 virtual-kubelet이 아닌 노드에 먼저 스케줄링이 되고, 배포되지 못한 나머지 파드가 virtual node에 배포된 것을 확인할 수 있습니다.

kubectl get po -owide
NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE                                NOMINATED NODE   READINESS GATES
aci-helloworld-86c987d849-9pw8r   1/1     Running   0          52s   10.240.0.55   aks-nodepool1-14565790-vmss000000   <none>           <none>
aci-helloworld-86c987d849-hp5nv   1/1     Running   0          53s   10.240.0.8    aks-nodepool1-14565790-vmss000001   <none>           <none>
aci-helloworld-86c987d849-rh9tx   1/1     Running   0          52s   10.241.0.4    virtual-node-aci-linux              <none>           <none>
aci-helloworld-86c987d849-v8kdx   1/1     Running   0          52s   10.240.0.18   aks-nodepool1-14565790-vmss000001   <none>           <none>

즉, 해당 파드는 toleration이 지정되어 있기 때문에 virtual node에도 배포가 가능하므로, unschedulable 파드가 virtual node로 배포가 됩니다. Cluster Autoscaler를 사용하지 않고도 Virtual Nodes를 통해 확장성을 가질 수 있습니다.

이때 aci-connector-linux 파드 로그를 살펴보면 실제로 ACI에서 container group을 생성하는 로그를 확인할 수 있습니다. 마지막에 컨테이너가 Started 된 로그를 ACI를 통해 전달받은 것을 확인할 수 있습니다.

time="2025-03-22T16:05:41Z" level=info msg="creating container group with name: default-aci-helloworld-6d49f9cfbc-h76bc" addedViaRedirty=false azure.region=koreacentral azure.resourceGroup=MC_aks-vn-rg_aks-vn_koreacentral delayedViaRateLimit=5ms key=default/aci-helloworld-6d49f9cfbc-h76bc method=CreateContainerGroup name=aci-helloworld-6d49f9cfbc-h76bc namespace=default originallyAdded="2025-03-22 16:05:41.362846244 +0000 UTC m=+488.605163852" phase=Pending plannedForWork="2025-03-22 16:05:41.367846244 +0000 UTC m=+488.610163852" pod=aci-helloworld-6d49f9cfbc-h76bc queue=syncPodsFromKubernetes reason= requeues=0 uid=d6a836b6-6b7d-4b57-90db-a5c109d17d6a workerId=49
...
time="2025-03-22T16:05:43Z" level=warning msg="cannot fetch aci events for pod aci-helloworld-6d49f9cfbc-h76bc in namespace default" error="cg is not found" method=PodsTracker.processPodUpdates
time="2025-03-22T16:05:43Z" level=info msg="Created pod in provider" addedViaRedirty=false delayedViaRateLimit=5ms key=default/aci-helloworld-6d49f9cfbc-h76bc method=createOrUpdatePod name=aci-helloworld-6d49f9cfbc-h76bc namespace=default originallyAdded="2025-03-22 16:05:41.362846244 +0000 UTC m=+488.605163852" phase=Pending plannedForWork="2025-03-22 16:05:41.367846244 +0000 UTC m=+488.610163852" pod=aci-helloworld-6d49f9cfbc-h76bc queue=syncPodsFromKubernetes reason= requeues=0 uid=d6a836b6-6b7d-4b57-90db-a5c109d17d6a workerId=49
time="2025-03-22T16:05:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"aci-helloworld-6d49f9cfbc-h76bc\", UID:\"d6a836b6-6b7d-4b57-90db-a5c109d17d6a\", APIVersion:\"v1\", ResourceVersion:\"5821\", FieldPath:\"\"}): type: 'Normal' reason: 'ProviderCreateSuccess' Create pod in provider successfully"
E0322 16:05:43.818182       1 event.go:346] "Server rejected event (will not retry!)" err="events is forbidden: User \"system:serviceaccount:kube-system:aci-connector-linux\" cannot create resource \"events\" in API group \"\" in the namespace \"default\"" event="&Event{ObjectMeta:{aci-helloworld-6d49f9cfbc-h76bc.182f2b9f418838d9  default    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:default,Name:aci-helloworld-6d49f9cfbc-h76bc,UID:d6a836b6-6b7d-4b57-90db-a5c109d17d6a,APIVersion:v1,ResourceVersion:5821,FieldPath:,},Reason:ProviderCreateSuccess,Message:Create pod in provider successfully,Source:EventSource{Component:virtual-node-aci-linux/pod-controller,Host:,},FirstTimestamp:2025-03-22 16:05:43.814912217 +0000 UTC m=+491.057229925,LastTimestamp:2025-03-22 16:05:43.814912217 +0000 UTC m=+491.057229925,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:virtual-node-aci-linux/pod-controller,ReportingInstance:,}"
...
time="2025-03-22T16:06:50Z" level=error msg="failed to retrieve pod aci-helloworld-6d49f9cfbc-h76bc status from provider" error="container aci-helloworld properties CurrentState StartTime cannot be nil" method=PodsTracker.processPodUpdates
time="2025-03-22T16:06:55Z" level=info msg="Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"aci-helloworld-6d49f9cfbc-h76bc\", UID:\"d6a836b6-6b7d-4b57-90db-a5c109d17d6a\", APIVersion:\"v1\", ResourceVersion:\"5821\", FieldPath:\"spec.containers{aci-helloworld}\"}): type: 'Normal' reason: 'Pulling' pulling image \"mcr.microsoft.com/azuredocs/aci-helloworld@sha256:b9cec4d6b50c6bf25e3f7f93bdc1628e5dca972cf132d38ed8f5bc955bb179c3\""
time="2025-03-22T16:06:55Z" level=info msg="Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"aci-helloworld-6d49f9cfbc-h76bc\", UID:\"d6a836b6-6b7d-4b57-90db-a5c109d17d6a\", APIVersion:\"v1\", ResourceVersion:\"5821\", FieldPath:\"spec.containers{aci-helloworld}\"}): type: 'Normal' reason: 'Pulled' Successfully pulled image \"mcr.microsoft.com/azuredocs/aci-helloworld@sha256:b9cec4d6b50c6bf25e3f7f93bdc1628e5dca972cf132d38ed8f5bc955bb179c3\""
time="2025-03-22T16:06:55Z" level=info msg="Event(v1.ObjectReference{Kind:\"Pod\", Namespace:\"default\", Name:\"aci-helloworld-6d49f9cfbc-h76bc\", UID:\"d6a836b6-6b7d-4b57-90db-a5c109d17d6a\", APIVersion:\"v1\", ResourceVersion:\"5821\", FieldPath:\"spec.containers{aci-helloworld}\"}): type: 'Normal' reason: 'Started' Started container"
...

앞서 살펴본바와 같이 EKS의 Fargate에 실행되는 워크로드는 일반 노드에 배포되지 않는 배타적인 성격의 스케줄링이 된다면, AKS의 Virtual Nodes에 실행은 일반 노드에 대한 보완적인 관계가 됩니다. 즉 일반 노드에 배포되고, 그 이상의 리소스가 필요할 때 Cluster Autoscaler가 없어도 Virtual Nodes를 활용하는 시나리오를 사용할 수 있습니다.

물론 특별한 요구사항이 있는 경우에는 Virtual Nodes에만 배포되도록 아래와 같이 NodeSelector와 같은 스케줄링 기법을 사용하실 수 있습니다. 혹은 tolerance를 사용하시는 경우에도 Virtual Nodes에 먼저 스케줄링 되게 됩니다.

...
      nodeSelector:
        kubernetes.io/role: agent
        beta.kubernetes.io/os: linux
        type: virtual-kubelet
...

EKS는 Fargate 파드와 노드가 1:1로 맵핑되는 반면, AKS의 Virtual Nodes는 해당 노드에 실행되는 파드가 많아져도 대응하는 노드는 1대입니다.

아래와 같이 디플로이먼트를 6개로 스케일링하고 Virtual Nodes에 여러 개의 파드가 배포되도록 유도합니다. 노드 정보를 확인해보면 virtual node는 한대만 있는 것을 알 수 있습니다.

kubectl scale deployment aci-helloworld --replicas 6
deployment.apps/aci-helloworld scaled

kubectl get po -owide
NAME                              READY   STATUS    RESTARTS   AGE     IP            NODE                                NOMINATED NODE   READINESS GATES
aci-helloworld-86c987d849-9dbdx   1/1     Running   0          79s     10.240.0.42   aks-nodepool1-14565790-vmss000000   <none>           <none>
aci-helloworld-86c987d849-9pw8r   1/1     Running   0          7m49s   10.240.0.55   aks-nodepool1-14565790-vmss000000   <none>           <none>
aci-helloworld-86c987d849-fchwx   1/1     Running   0          79s     10.241.0.5    virtual-node-aci-linux              <none>           <none>
aci-helloworld-86c987d849-rh9tx   1/1     Running   0          7m49s   10.241.0.4    virtual-node-aci-linux              <none>           <none>
..

kubectl get no
NAME                                STATUS   ROLES    AGE   VERSION
aks-nodepool1-14565790-vmss000000   Ready    <none>   41m   v1.30.9
aks-nodepool1-14565790-vmss000001   Ready    <none>   41m   v1.30.9
virtual-node-aci-linux              Ready    agent    29m   v1.25.0-vk-azure-aci-1.6.2

마지막으로 포탈에서 확인하면 AKS의 인프라스트럭처 리소스 그룹에 Virtual Nodes에 해당하는 파드들이 ACI의 형태로 실행되고 있는 것을 알 수 있습니다.

Virtual Node의 파드는 Azure 입장에서는 ACI의 형태로 실행되기 때문에 포탈에서 접근해서 ACI의 UI를 통해 로그 확인/콘솔 접속 등을 사용할 수 있습니다.

AKS의 Virtual Nodes 또한 몇 가지 제약사항을 가지고 있습니다. 이는 ACI의 제약사항을 상속받은 것일 수 있으며, daemonset이나 initContainer와 같은 사용이 불가한 점도 있습니다.

AKS Virtual Nodes에 대해서 문서의 제약사항을 살펴보시기 부탁드립니다.

https://learn.microsoft.com/en-us/azure/aks/virtual-nodes#limitations

리소스를 정리하고 실습을 마무리 하겠습니다.

az group delete --name $RG

마무리

EKS와 AKS에서 노드를 사용하지 않고 파드를 실행할 수 있는 방식을 살펴보았습니다.

EKS Fargate는 Admission controller를 통하여 Fargate scheduler를 통해 스케줄링을 하는 방식이었다면, AKS는 virtual kubelet을 통해서 Virtual Nodes를 등록하고 해당 노드의 Taint를 통해서 스케줄링을 유도하는 방식을 사용할 수 있었습니다.

그럼 이번 포스트를 마무리 하겠습니다.

저작자표시

'EKS' 카테고리의 다른 글

[8] EKS Upgrade (0)	2025.04.02
[6] EKS의 Security - EKS 인증/인가와 Pod IAM 권한 할당 (0)	2025.03.16
[5-2] EKS의 오토스케일링 Part2 (0)	2025.03.07
[5-1] EKS의 오토스케일링 Part1 (0)	2025.03.07
[4] EKS의 모니터링과 로깅 (0)	2025.03.01

CKA 취득 후기 (2025년 2월 18일 리뉴얼)

한명 2025. 3. 22. 16:23

2025. 3. 22. 16:23

이번 포스트는 CKA(Certified Kubernetes Administrator) 시험 후기 입니다.

이전 CKA를 2018년에 취득했기 때문에 이미 예전에 만료되었습니다. 최근 Kubernetes 관련 자격증을 갱신하면서 만료된 CKA 자격 시험을 응시하였고 3/19일에 취득하여 후기를 남겨드립니다.

기본정보

CKA는 실습형 시험으로 2시간 동안 16문제를 풀어야 합니다.

시험 환경은 Secure Browser 환경에서 시험을 보게 됩니다. 시험 UI는 아래와 같으며, 좌측 시험 문제가 있고, 우측에 브라우저와 터미널을 실행할 수 있는 창을 띄울 수 있습니다.

출처: https://docs.linuxfoundation.org/tc-docs/certification/tips-cka-and-ckad#adjusting-font-and-windows-in-the-examui

시험 UI가 다소 무거운 편이며, 특히 제공되는 메모장은 엉망이었습니다. 그래서 실제로 메모장을 제대로 활용하기는 어렵습니다. 이때문에 vi를 잘 사용해야합니다.

그 외 UI에 특이점은 없지만 터미널에서 복사/붙여넣기는 ctrl+shift+c, ctrl+shift+v 를 해야하는 정도만 알고 계시면 될 것 같습니다.

스크린 화면이 작으면 시험을 보는데 불리합니다. 가능하면 모니터를 통해서 시험을 보시기 바랍니다. 듀얼 모니터 사용은 불가하지만 외부 모니터를 사용하는 것은 가능합니다. 단, 모니터를 사용하려면 웹캠(감시자를 통한 시험 환경 모니터링 용)이 있어야 합니다.

과거는 kubeconfig의 contexts를 변경하면서 각 문제별 다른 컨텍스트를 활용했다면, 현재 시험 시스템은 각 문제별로 ssh로 VM으로 접속하는 방식입니다. 시험 문제에 각 문제에 해당하는 ssh 커맨드와 토픽에 대한 링크가 제공됩니다.

ssh로 노드에 접속하면 kubectl 등이 사용 가능합니다. 문제를 풀면 다시 최초 터미널로 돌아와서 다시 ssh로 접속하는 방식을 반복하시면 됩니다.

컨텐츠 리뉴얼

CKA를 준비하고 있다면 CKA의 토픽과 문제들이 2025년 2월 18일에 리뉴얼 되었다는 것을 이미 알고 있을 것입니다. (리뉴얼 전에 시험을 봤어야 했는데 많이 당황하기는 했습니다)

리뉴얼된 CKA에서 기존에 알려진 시험 문제 유형 중 2~3문제 정도는 동일한 것 같습니다. 나머지는 모두 다른 토픽과 유형이었습니다. 오브젝트의 정보를 확인 후 작성하는 방식의 쉬운 문제는 거의 없어졌습니다.

리뉴얼된 시험은 이전 버전 보다 다소 어려워진 느낌입니다.

아래와 같은 피드백을 살펴보시면, 2/18일 이후 리뉴얼된 시험을 응시한 사람들의 공통적 의견은 시험이 많이 어려워졌다 입니다.

https://kodekloud.com/community/t/just-took-the-latest-version-of-the-cka-exam-failed-miserably-need-some-advice/474751

https://www.reddit.com/r/CKAExam/comments/1jbi4iw/discussion_of_the_updated_feb_18th_2025_cka_exam/

먼저 시험 변경사항에 대해서는 아래의 공지를 살펴보시기 바랍니다.

https://training.linuxfoundation.org/certified-kubernetes-administrator-cka-program-changes/

또한 아래 영상을 통해서 변경사항에 대한 인사이트를 살펴보실 수 있습니다.

https://www.youtube.com/watch?v=fvvgM3QmKGo

Linux Foundation 의 공지를 바탕으로 각 도메인 별로 추가된 토픽을 살펴보겠습니다.

전반적으로 도메인은 기존과 동일하며, 각 도메인에서 일부 토픽이 추가되었고, 동일한 토픽에 대해서도 문제 유형이 전부 수정되었습니다.

StorageClass, Gateway API 와 같은 부분이 추가되었습니다.

Helm, Kustomize와 같은 애플리케이션 배포 기술과 CNI, CSI, CRI에 대한 기본적인 설치와 구성에 대한 이해가 필요합니다. 추가로 CRD와 같은 토픽이 추가되었습니다.

노드에 쿠버네티스를 설치, 트러블 슈팅을 위한 기본 커맨드와 컨트롤 플레인 구성요소를 관리하는 방식에 대한 이해가 필요합니다.

시험에 대한 유형이나 상세한 문제를 공개하기 어렵기 때문에 추가된 토픽을 중심으로 학습해보시기를 부탁드립니다.

물론 CKA는 retake가 가능한 시험이기 때문에 첫번째 시험에서 토픽을 잘 확인하고, 추가 학습 후 두번째 시험을 보시는 것도 방법입니다. 첫번째 시험에서 풀리지 않는 문제를 푸느라 시간 배분을 못하면 오히려 나머지 토픽을 모두 확인하지 못할 수도 있습니다.

시험에 도움이 되셨으면 하며 마무리 하겠습니다.

저작자표시

'기타' 카테고리의 다른 글

curl의 다양한 옵션 (0)	2025.03.06
KCNA, KCSA 후기 (0)	2025.02.25
VS Code를 markdown editor로 사용하기 (0)	2025.02.06
VS Code에서 REST 테스트 하기 (0)	2023.11.05
wsl: docker, kind 설치 (0)	2023.11.05

[6] EKS의 Security - EKS 인증/인가와 Pod IAM 권한 할당

한명 2025. 3. 16. 00:54

2025. 3. 16. 00:54

이번 포스트에서는 EKS의 보안(Security)에 대해서 알아 보겠습니다.

물론 쿠버네티스의 보안에는 이미지 보안, 노드 보안과 같은 영역도 있지만, 여기서는 쿠버네티스의 인증(Authentication)/인가(Authorization)가 EKS에 적용된 방식과, 워크로드(파드)의 AWS의 리소스에 대한 보안 접근이라는 두 가지 주제를 살펴보겠습니다.

먼저 EKS의 인증/인가의 흐름을 kubeconfig를 바탕으로 이해해보고, 두번째로 워크로드(파드)에 AWS의 리소스에 접근 권한을 부여하기 위해 파드에 IAM을 할당하는 방식에 대해서 살펴보겠습니다.

1. Kubernetes의 인증/인가

쿠버네티스에서는 API를 접근을 통제하기 위해서 아래와 같은 방식이 사용됩니다.

사용자나 파드(Service Account)는 Authentication(인증) -> Authorization(인가) -> Admission Control을 단계를 지나서 비로소 쿠버네티스 API에 접근할 수 있습니다.

출처: https://kubernetes.io/docs/concepts/security/controlling-access/

다만 쿠버네티스 자체는 직접적으로 사용자를 저장해 인증하는 방식을 구현하지 않고 있기 때문에 다른 인증 시스템에 위임을 하여 사용자에 대한 인증을 진행할 수 있습니다.

그 이후 인가 단계에서는 인증된 주체가 쿠버네티스 리소스에 대한 적절한 접근 권한을 가진 여부를 체크하게 됩니다. 마지막으로 Admission Control에서 요청 자체에 대한 Validation이나 Mutation과 같은 추가적인 절차를 진행할수 있도록 설계되어 있습니다.

그리고 그림에서 살펴보듯이 각 단계는 퍼즐 조각처럼 여러 형태의 인증, 인가, Admission Control을 선택적으로 추가할 수 있도록 되어 있습니다.

이후 EKS의 인증/인가 절에서는 AWS의 인증/엑세스 관리를 담당하는 IAM을 통해서 쿠버네티스의 인증/인가를 진행하는 과정을 설명하겠습니다. 즉, AWS에서 유효한 주체(사용자)가 어떻게 쿠버네티스의 인증/인가를 거쳐 쿠버네티스를 이용할 수 있는가의 관점입니다.

2. EKS의 인증/인가

사용자의 EKS의 인증/인가 체계를 이해하기 위해서 아래 그림을 바탕으로 kubectl 명령의 실행 흐름을 따라가 보겠습니다.

출처: https://www.youtube.com/watch?v=bksogA-WXv8&t=600s

1) kubectl get node 명령을 수행

2) kubeconfig에 정의된 aws eks get-token 명령으로 AWS STS reigional endpoint로 Amazon EKS 클러스터에 대한 인증 토큰 요청

3) aws eks get-token의 응답으로 Token 값 수신(base64로 디코딩하면 STS(Secure Token Service)로 GetCallerIdentity를 호출하는 Pre-Signed URL 값이 들어가 있음)

<< 이 단계까지는 EKS API Endpoint로 인증 요청 전 단계 >>

4) kubectl는 Pre-Signed URL을 bearer Token으로 EKS API Cluster Endpoint로 요청

5) API 서버는 aws-iam-authenticator server(Webhook Token Authentication)로 Token Review 요청

6) aws-iam-authenticator server에서 sts GetCallerIdentity를 호출

7) AWS IAM은 토큰이 유효한지 확인 후 인증 완료하고, IAM User나 Role에 대한 ARN을 반환

8) IAM의 User/Role을 쿠버네티스의 그룹으로 맵핑한 aws-auth(ConfigMap)을 통해 쿠버네티스의 보안 주체를 확인

9) aws-iam-authenticator server(Webhook Token Authentication)에서는 TokenReview라는 데이터 타입으로 useruame과 쿠버네티스 group을 반환

<< 이 단계까지가 인증의 단계 >>

10) 이 정보를 바탕으로 Kubernetes RBAC 기반 인가 진행

<< 이 단계까지가 인가의 단계 >>

11) 인가된 경우 kubectl get node에 대한 결과 반환

이와 같이 kubectl를 수행하면 IAM을 통해 사용자를 인증하고, 쿠버네티스 RBAC에 따라 인가를 하게 됩니다.

이를 다시 요약하여 아래와 같은 4단계로 나눠보겠습니다.

1) kubectl 요청을 수행하면 AWS 인증 정보를 통하여 EKS 클러스터에 대한 인증 토큰 요청

2) Webhook Token Authentication을 따라 IAM을 통한 인증 진행

3) 인증 완료 후 정보로 반환된 arn을 바탕으로 쿠버네티스 그룹과의 맵핑을 확인하는데, 이 절차는 아래와 같이 두가지 방식이 있습니다.

aws-auth(ConfigMap) 방식 (deprecated 될 예정)
EKS API 방식

4) 인증된 IAM 정보를 바탕으로 쿠버네티스 RBAC을 통해 인가 진행

참고로 위의 설명에서 aws-auth(컨피그 맵)에 IAM Role/User의 arn과 쿠버네티스의 권한 그룹과 맵핑 정보를 담고 있습니다. 사용자는 eksctl create iamidentitymapping를 통해 IAM 사용자와 클러스터 그룹을 맵핑하고, 이것이 컨피그 맵에 반영됩니다.

다만 이 방식은 컨피그 맵이 쿠버네티스에 노출되므로, 잘못 수정하는 경우 클러스터에 이슈가 발생할 수 있는 등 여러가지 문제가 있어 최근 EKS API 방식을 도입하였습니다.

EKS API는 컨피그 맵이 없어지고 EKS API를 통해서 Access Entry에 IAM Role/User와 Access Policy를 맵핑하여 관리하도록 합니다. IAM을 통한 인증 완료 후 반환된 ARN 정보를 EKS API의 Access Entry 맵핑을 확인하고, 이후 쿠버네티스 RBAC 인가를 받도록 절차가 변경됩니다.

출처: https://aws.amazon.com/ko/blogs/containers/a-deep-dive-into-simplified-amazon-eks-access-management-controls/

웹 콘솔을 접근해 EKS에서 Access 탭에서 Authentication mode를 확인할 수 있습니다. 기본 생성된 EKS 클러스터는 EKS API 및 ConfigMap이 선택되어 있습니다. 이 옵션에서 EKS API와 ConfigMap이 중복 설정되는 경우는 EKS API가 우선적용됩니다.

해당 페이지의 Manage access를 통해서 아래와 같이 변경 가능한 인터페이스가 있습니다.

이제 실제 EKS 환경에서 동작을 확인해보겠습니다.

클러스터 엑세스: ConfigMap

해당 옵션의 설정 방식을 살펴보기 위해서 아래와 같이 testuser를 만들고, EKS 클러스터에 접근하기 위한 권한을 할당하는 방식을 알아보겠습니다.

# testuser 사용자 생성
aws iam create-user --user-name testuser

# 사용자에게 프로그래밍 방식 액세스 권한 부여
aws iam create-access-key --user-name testuser
{
    "AccessKey": {
        "UserName": "testuser",
        "AccessKeyId": "AKIA5ILF2##",
        "Status": "Active",
        "SecretAccessKey": "TxhhwsU8##",
        "CreateDate": "2023-05-23T07:40:09+00:00"
    }
}
# testuser 사용자에 정책을 추가
aws iam attach-user-policy --policy-arn arn:aws:iam::aws:policy/AdministratorAccess --user-name testuser

# 아래 실습은 kubectl을 신규로 세팅하기 위해 기존 aws configure가 되지 않은 VM에서 진행합니다.
# testuser 자격증명 설정
aws configure
AWS Access Key ID [None]: ...
AWS Secret Access Key [None]: ....
Default region name [None]: ap-northeast-2

# get-caller-identity 확인
aws sts get-caller-identity --query Arn
"arn:aws:iam::911283464785:user/testuser"

# testuser에 대한 kubeconfig를 획득합니다.
CLUSTER_NAME=myeks
aws eks update-kubeconfig --name $CLUSTER_NAME --user-alias testuser

# kubectl 시도 
kubectl get node
E0315 22:46:29.480897    1795 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: the server has asked for the client to provide credentials"
E0315 22:46:30.514466    1795 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: the server has asked for the client to provide credentials"
E0315 22:46:31.629986    1795 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: the server has asked for the client to provide credentials"
E0315 22:46:32.658748    1795 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: the server has asked for the client to provide credentials"
E0315 22:46:33.649009    1795 memcache.go:265] "Unhandled Error" err="couldn't get current server API group list: the server has asked for the client to provide credentials"
error: You must be logged in to the server (the server has asked for the client to provide credentials)

# kubeconfig 확인
cat ~/.kube/config
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: ~
    server: ~
contexts:
- context:
    cluster: arn:aws:eks:ap-northeast-2:xx:cluster/myeks
    user: testuser
  name: testuser
current-context: testuser
kind: Config
preferences: {}
users:
- name: testuser
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - --region
      - ap-northeast-2
      - eks
      - get-token
      - --cluster-name
      - myeks
      - --output
      - json
      command: aws

kubectl get cm -n kube-system aws-auth -o yaml

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::xx:role/eksctl-myeks-nodegroup-ng1-NodeInstanceRole-e7CGUnBoQC96
      username: system:node:{{EC2PrivateDNSName}}
kind: ConfigMap
metadata:
  creationTimestamp: "2025-03-15T13:10:23Z"
  name: aws-auth
  namespace: kube-system
  resourceVersion: "2028"
  uid: 13151df2-4cd2-4fc2-92dc-2b0289a1be55

testuser도 AdministratorAccess 권한을 가지고 있지만 실제로 EKS의 API Server에 인증되는 권한은 없습니다. (단, 컨피그 맵에는 정보가 없지만 EKS를 생성한 admin 계정은 EKS API의 Access Entry에 등록되어 권한이 있음)

아래 iamidentitymapping 를 생성하면 aws-auth(컨피그 맵)이 업데이트 됩니다.

# Creates a mapping from IAM role or user to Kubernetes user and groups
eksctl get iamidentitymapping --cluster $CLUSTER_NAME
ARN                                                                                     USERNAME                                GROUPS                          ACCOUNT
arn:aws:iam::xx:role/eksctl-myeks-nodegroup-ng1-NodeInstanceRole-e7CGUnBoQC96 system:node:{{EC2PrivateDNSName}}       system:bootstrappers,system:nodes

# IAM Identity Mapping 생성
ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
eksctl create iamidentitymapping --cluster $CLUSTER_NAME --username testuser --group system:masters --arn arn:aws:iam::$ACCOUNT_ID:user/testuser

# 확인
eksctl get iamidentitymapping --cluster $CLUSTER_NAME
ARN                                                                                     USERNAME                                GROUPS                          ACCOUNT
arn:aws:iam::xx:role/eksctl-myeks-nodegroup-ng1-NodeInstanceRole-e7CGUnBoQC96 system:node:{{EC2PrivateDNSName}}       system:bootstrappers,system:nodes
arn:aws:iam::xx:user/testuser                                                 testuser                                system:masters

kubectl get cm -n kube-system aws-auth -o yaml
...
  mapUsers: |
    - groups:
      - system:masters
      userarn: arn:aws:iam::xx:user/testuser
      username: testuser
...

다만 IAM Identity Mapping 생성 후에 즉각적으로 kubectl 이 가능해지는 것은 아니며, 반영에 일부 시간이 걸릴 수 있습니다.

이후에 실행해보면 비로소 EKS에서 kubectl이 성공합니다.

# 시도
kubectl get node 
NAME                                               STATUS   ROLES    AGE   VERSION
ip-192-168-1-41.ap-northeast-2.compute.internal    Ready    <none>   64m   v1.31.5-eks-5d632ec
ip-192-168-2-79.ap-northeast-2.compute.internal    Ready    <none>   65m   v1.31.5-eks-5d632ec
ip-192-168-3-202.ap-northeast-2.compute.internal   Ready    <none>   65m   v1.31.5-eks-5d632ec

실습을 마무리하고, 다음 실습을 위해서 iamidentitymapping을 삭제하겠습니다.

# testuser IAM 맵핑 삭제
eksctl delete iamidentitymapping --cluster $CLUSTER_NAME --arn  arn:aws:iam::$ACCOUNT_ID:user/testuser

# Get IAM identity mapping(s)
eksctl get iamidentitymapping --cluster $CLUSTER_NAME
kubectl get cm -n kube-system aws-auth -o yaml

클러스터 엑세스: EKS API

웹 콘솔의 EKS>Access>IAM access entries 를 보면 현재 할당된 권한을 확인할 수 있습니다.

현재 EKS API and ConfigMap으로 해당 클러스터를 생성한 관리자 계정은 이미 AmazoneEKSClusterAdminPolicy를 할당 받은 것으로 확인 됩니다.

먼저 아래 명령으로 EKS API 엑세스 모드로 변경합니다. 옵션을 변경하는 경우 다시 기존 옵션으로 원복은 불가한점 유의가 필요합니다.

# EKS API 액세스모드로 변경
aws eks update-cluster-config --name $CLUSTER_NAME --access-config authenticationMode=API

웹 콘솔에서도 변경된 것으로 확인 됩니다.

참고로 아래 문서를 살펴보시면 EKS를 위해 생성된 Access Policy와 어떤 권한이 할당되었는지를 확인하실 수 있으며, 현재 제공되는 Policy는 아래와 같습니다. CLI에서는 aws eks list-access-policies 를 통해서 확인 가능 합니다.

https://docs.aws.amazon.com/eks/latest/userguide/access-policy-permissions.html

현재 생성된 Access Entry를 확인 할 수 있습니다.

# 현재 생성된 Access Entry 확인
aws eks list-access-entries --cluster-name $CLUSTER_NAME | jq
{
  "accessEntries": [
    "arn:aws:iam::xx:role/aws-service-role/eks.amazonaws.com/AWSServiceRoleForAmazonEKS",
    "arn:aws:iam::xx:role/eksctl-myeks-nodegroup-ng1-NodeInstanceRole-e7CGUnBoQC96",
    "arn:aws:iam::xx:user/eksadmin"
  ]
}

# admin 계정의 Associated Access Policy 확인 -> AmazonEKSClusterAdminPolicy
export ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
aws eks list-associated-access-policies --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/admin | jq # Linux
{
    "associatedAccessPolicies": [
        {
            "policyArn": "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy",
            "accessScope": {
                "type": "cluster",
                "namespaces": []
            },
            "associatedAt": "2025-03-15T21:56:26.361000+09:00",
            "modifiedAt": "2025-03-15T21:56:26.361000+09:00"
        }
    ],
    "clusterName": "myeks",
    "principalArn": "arn:aws:iam:xx:user/eksadmin"
}

앞서 생성한 testuser에 대해서 Access Entry를 생성하고 Associated Access Policy를 연결합니다.

# testuser 의 access entry 생성
aws eks create-access-entry --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/testuser
aws eks list-access-entries --cluster-name $CLUSTER_NAME | jq -r .accessEntries[]

# testuser에 AmazonEKSClusterAdminPolicy 연동
aws eks associate-access-policy --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/testuser \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy --access-scope type=cluster

#  Associated Access Policy 확인
aws eks list-associated-access-policies --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/testuser
{
    "associatedAccessPolicies": [
        {
            "policyArn": "arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy",
            "accessScope": {
                "type": "cluster",
                "namespaces": []
            },
            "associatedAt": "2025-03-15T23:30:17.290000+09:00",
            "modifiedAt": "2025-03-15T23:30:17.290000+09:00"
        }
    ],
    "clusterName": "myeks",
    "principalArn": "arn:aws:iam::xx:user/testuser"
}

기존 testuser에서 EKS API를 통해서도 정상적으로 kubectl 수행이 가능합니다.

# kubectl 시도
kubectl get node 
NAME                                               STATUS   ROLES    AGE   VERSION
ip-192-168-1-41.ap-northeast-2.compute.internal    Ready    <none>   79m   v1.31.5-eks-5d632ec
ip-192-168-2-79.ap-northeast-2.compute.internal    Ready    <none>   79m   v1.31.5-eks-5d632ec
ip-192-168-3-202.ap-northeast-2.compute.internal   Ready    <none>   79m   v1.31.5-eks-5d632ec

# 현재는 AmazonEKSClusterAdminPolicy 이기 때문에 해당 작업이 가능함
kubectl auth can-i delete pods --all-namespaces
yes

# 컨피그 맵에 값이 반영되지 않는 것을 알 수 있습니다.
kubectl get cm -n kube-system aws-auth -o yaml
...
  mapUsers: |
    []
...

한편 Access Entry 자체를 쿠버네티스 그룹과도 맵핑해줄 수 있습니다. 먼저 앞서 생성한 Access Entry를 제거하고, 쿠버네티스 그룹과 맵핑을 해보겠습니다.

# 기존 testuser access entry 제거
aws eks delete-access-entry --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/testuser
aws eks list-access-entries --cluster-name $CLUSTER_NAME | jq -r .accessEntries[]

# 확인
(testuser:N/A) [root@operator-host-2 ~]# kubectl get no
error: You must be logged in to the server (Unauthorized)

# Cluster Role 생성
cat <<EoF> ~/pod-viewer-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: pod-viewer-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["list", "get", "watch"]
EoF

kubectl apply -f ~/pod-viewer-role.yaml

# Cluster Rolebinding 생성
kubectl create clusterrolebinding viewer-role-binding --clusterrole=pod-viewer-role --group=pod-viewer


# Access Entry 생성 (--kubernetes-group 옵션 추가)
aws eks create-access-entry --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/testuser --kubernetes-group pod-viewer

# acess policy 자체에서는 정보가 보이지 않는다.
aws eks list-associated-access-policies --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/testuser

{
    "associatedAccessPolicies": [],
    "clusterName": "myeks",
    "principalArn": "arn:aws:iam::xx:user/testuser"
}

aws eks describe-access-entry --cluster-name $CLUSTER_NAME --principal-arn arn:aws:iam::$ACCOUNT_ID:user/testuser | jq
{
  "accessEntry": {
    "clusterName": "myeks",
    "principalArn": "arn:aws:iam::xx:user/testuser",
    "kubernetesGroups": [
      "pod-viewer"
    ],
    "accessEntryArn": "arn:aws:eks:ap-northeast-2:xx:access-entry/myeks/user/xx/testuser/4ccacd1e-2a2e-fd71-93e2-94ead13e95e3",
    "createdAt": "2025-03-15T23:36:39.977000+09:00",
    "modifiedAt": "2025-03-15T23:36:39.977000+09:00",
    "tags": {},
    "username": "arn:aws:iam::xx:user/testuser",
    "type": "STANDARD"
  }
}

아래와 같이 정보를 확인해봅니다.

# kubectl 시도 (node 조회는 불가, po는 조회 가능)
(testuser:N/A) [root@operator-host-2 ~]# kubectl get no
Error from server (Forbidden): nodes is forbidden: User "arn:aws:iam::xx:user/testuser" cannot list resource "nodes" in API group "" at the cluster scope
(testuser:N/A) [root@operator-host-2 ~]# kubectl get po
No resources found in default namespace.

# can-i 로 확인
kubectl auth can-i get pods --all-namespaces
yes
kubectl auth can-i delete pods --all-namespaces
no

EKS의 인증/인가에 대한 실습을 마무리하겠습니다.

3. AKS의 인증/인가

Azure에서는 Microsoft Entra ID(이전 명칭: Azure AD(Azure Active Directory))라는 ID 및 엑세스 관리 서비스를 가지고 있습니다.

AKS의 인증 또한 Entra ID를 이용할 수 있으며 Azure RBAC을 함께 사용하여 다양한 옵션을 제공하고 있습니다.

애저 포탈에서 AKS의 Settings>security configuration를 확인해보면 AKS에서 선택 가능한 인증/인가 방식을 확인할 수 있습니다.

사실 이부분에 대해서 제대로 이 분류를 설명하고 있는 문서가 없고, 아래의 공식 문서 또한 각 용어들을 산별적으로 설명하고 있어 이해하기가 쉽지 않습니다.

https://learn.microsoft.com/en-us/azure/aks/concepts-identity

이

것을 설명하는 데 있어서도 어려운 부분이라 단계별로 설명을 이어나가 보겠습니다.

사전 지식1

이때 Microsoft Entra ID는 Azure와 M365를 포괄하는 ID 인증 관리 체계로 이해할 수 있으며,

Azure RBAC은 Azure 수준에 대한 권한 부여 방식이라고 간단히 이해하고 넘어가겠습니다. Azure RBAC을 이용하는 역할 할당은 각 수준별(관리그룹, 구독, 리소스 그룹, 리소스) 엑세스 제어(IAM) 메뉴에서 가능합니다.

$Azure Portal의 액세스 제어(C:\Users\montauk\Desktop\STD\5. Seminar\202502_AEWS(EKS)\EKS과제6주차_20250310.assets\sub-role-assignments.png) 페이지 스크린샷.$

출처: https://learn.microsoft.com/ko-kr/azure/role-based-access-control/rbac-and-directory-admin-roles

Azure 관점에서 각 Azure 리소스의 엑세스 제어(IAM)에 Microsoft Entra ID의 주체(사용자, 애플리케이션)와 Role을 할당하는 것으로 간단히 생각하실 수 있습니다.

사전 지식2

AWS와 Azure의 IAM 부분에서 용어나 관점의 차이가 있습니다.

AWS의 Role은 사용자와 같은 주체의 의미입니다. Azure에서 이러한 주체는 Service Principal(SP)이나 Managed Identity(MI)라고 합니다.

Azure의 Role은 권한(Action)들의 집합을 의미합니다. AWS에서는 이 개념을 Policy라고 합니다.

AWS의 IAM에서는 사용자나 Role에 Policy를 할당합니다. AWS의 IAM은 주체 관점의 RBAC(주체에 리소스+권한을 할당)을 구현하고 있습니다.

Azure의 RBAC에서는 대상(리소스)에 사용자나 주체(SP,MI)를 Role을 맵핑합니다. Azure는 리소스 관점의 RBAC(리소스에 사용자+권한을 할당)을 구현하고 있습니다.

예를 들어, 특정 testuser에게 가상 머신의 관리자 권한을 준다고할 때 AWS와 Azure의 방식은 아래와 같습니다.

AWS는 신규 Policy에 가상머신을 선택하고 관리자 권한을 부여하고, testuser에게 이 Policy를 할당합니다.
Azure는 가상머신에서 testuser와 관리자 Role을 맵핑하여 권한을 할당합니다.

사전 지식3

AKS를 위한 Azure RBAC에서는 AKS 리소스를 위한 Role과 Kubernetes를 위한 Role이 구분되어 있습니다.

AKS 리소스를 위한 Role이라는 것은 Azure 리소스 차원에서 AKS에 대한 CRUD(클러스터 설정 변경, 노드 풀 스케일링 등)에 대한 권한입니다. 또한 az aks get-credentials를 통한 kubeconfig를 획득하기 위한 권한도 별도로 있습니다.

Kubernetes를 위한 Role은 쿠버네티스 내부의 리소스에 대한 CRUD(Deployment 생성, confimgMap 조회 등)을 의미하며, Microsoft Entra ID authentication with Azure RBAC에서 Azure RBAC을 통해서 Kubernetes에 대한 권한을 부여할 수 있다는 의미입니다.

사전 지식4

그 다음은 Local accounts라는 개념으로, 활성화 된 경우 admin에 해당하는 local account가 기본적으로 생성되어 있습니다. 이는 Microsoft Entra ID를 사용하는 경우에도 존재할 수 있으며, az aks get-credentials에 --admin 플래그를 사용하는 경우 kubeconfig admin credentials을 획득할 수 있습니다. 이를 통해 관리자가 Entra ID인증 없이 쿠버네티스를 접근할 수 있으나, 이는 보안에 취약할 수 있어 local accounts를 비활성화 할 수 있습니다.

참고: https://learn.microsoft.com/en-us/azure/aks/manage-local-accounts-managed-azure-ad

사전 지식을 바탕으로 아래에 대해서 설명을 이어 나가겠습니다.

Local accounts with Kubernetes RBAC

Microsoft Entra ID와 인증을 연동하지 않은 모드입니다.

이 구성에서 Azure Kubernetes Service Cluster User Role을 부여 받은 사용자나 그룹은 az aks get-credentials을 통해 kubeconfig를 획득할 수 있습니다. 이 Kubeconfig는 Kubernetes의 admin 권한입니다. 이후 Kubernetes RBAC을 통해서 인가를 구성할 수 있습니다. 이 구성에서는 user credentials이 admin 권한을 가지기 때문에 --admin 플래그와 차이가 없습니다.

이 방식은 Microsoft Entra ID를 통한 쿠버네티스 인증이 없습니다. 단순히 kubeconfig 를 획득할 수 있는 Role을 부여하거나 부여하지 않는 방식으로 사용자를 구분할 수 있지만, 획득한 kubeconfig는 모두 동일하게 쿠버네티스에 대한 admin 권한을 가집니다.

Microsoft Entra ID authentication with Kubernetes RBAC

이 구성은 Microsoft Entra ID로 인증을 연동하고, 또한 Entra ID의 사용자나 그룹을 Kubernetes RBAC의 주체로 사용할 수 있습니다.

아래는 Microsoft Entra ID authentication with Kubernetes RBAC를 선택한 옵션으로 Kubernetes의 admin에 대한 ClusterRoleBinding을 지정할 Entra ID의 그룹을 지정할 수 있습니다. 해당 지정된 그룹의 사용자는 쿠버네티스의 admin 에 해당하는 권한을 할당 받습니다.

이 때 사용자 시나리오는 관리자 그룹에서 Kubernetes의 role/binding을 관리하고, 나머지 사용자나 그룹에 Kubernetes의 RBAC을 부여하여 사용하는 방식입니다.

Azure Kubernetes Service Cluster Admin Role을 가지는 사용자는 Kubernetes에 대한 admin 권한을 가진 kubeconfig를 획득할 수 있습니다. 이 때문에 local account를 비활성화 하는 옵션이 아래에 표시되어 있습니다.

이 구성에서도 Azure Kubernetes Service Cluster User Role을 부여 받은 사용자나 그룹만 az aks get-credentials을 통해 kubeconfig를 획득할 수 있습니다.

이후 획득한 kubeconfig는 Microsoft Entra ID의 인증과 연동하도록 구성되어 있으며, 또한 Kubernetes RBAC 모드에서는 RoleBinding에서 Azure Entra ID의 사용자(UPN, User Principal Name)나 그룹(Object ID)을 지정할 수 있습니다.

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: dev-user-access
  namespace: dev
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: dev-user-full-access
subjects:
- kind: Group # 일반 유저인 경우 kind: User
  namespace: dev
  name: groupObjectId # 일반 유저인 경우 user principal name (UPN) 입력

이 방식은 EKS의 인증/인가 방식에서 ConfigMap을 사용하는 방식(IAM의 User를 쿠버네티스의 그룹과 맵핑하는 방식)과 유사하다고 생각됩니다. 한편, EKS API의 access entry에서도 IAM의 User와 쿠버네티스 그룹을 맵핑해줄 수도 있습니다.

Microsoft Entra ID authentication with Azure RBAC

이 구성은 Microsoft Entra ID로 인증을 연동하고, 또한 Azure RBAC을 통하여 쿠버네티스 인가를 사용할 수 있습니다.

이 구성에서도 Azure Kubernetes Service Cluster User Role을 부여 받은 사용자나 그룹만 az aks get-credentials을 통해 kubeconfig를 획득할 수 있습니다.

이후 획득한 kubeconfig는 Microsoft Entra ID의 인증과 연동하도록 구성되어 있으며, 또한 Azure RBAC을 통해서 쿠버네티스의 인가를 구성할 수 있습니다.

먼저 아래와 같은 Built-in Role을 사용하거나 별도의 Custom Role을 생성할 수도 있습니다.

https://learn.microsoft.com/en-us/azure/aks/manage-azure-rbac?tabs=azure-cli#aks-built-in-roles

또한 Azure RBAC을 통해서 해당 Role을 사용자/그룹에 할당할 수 있습니다. 이때 --scope을 통해서 특정 네임스페이스에 할당할 수 있는 점에서 완전히 Azure RBAC을 통해 쿠버네티스 수준의 관리까지 지원을 하고 있는 것을 알 수 있습니다.

# AKS에 권한 할당
az role assignment create --role "Azure Kubernetes Service RBAC Admin" --assignee <AAD-ENTITY-ID> --scope $AKS_ID

# 특정 Namespace에 권한 할당
az role assignment create --role "Azure Kubernetes Service RBAC Reader" --assignee <AAD-ENTITY-ID> --scope $AKS_ID/namespaces/<namespace-name>

참고로, Azure Kubernetes Service Cluster Admin Role을 가지는 사용자는 Kubernetes에 대한 admin 권한을 가진 kubeconfig를 획득할 수 있습니다. 이 때문에 local account를 비활성화할 수 있습니다.

Microsoft Entra ID authentication with Azure RBAC의 접근 방식은 AKS라는 리소스에 권한을 할당하는 방식만으로 쿠버네티스 권한 관리까지 수행할 수 있기 때문에, Azure 수준에서 AKS에 할당된 모든 권한(리소스와 쿠버네티스 권한)를 확인할 수 있다는 장점이 있습니다.

이 방식은 EKS의 인증/인가 방식에서 EKS API를 사용하는 방식(AWS IAM의 User를 정의된 Policy와 맵핑하는 방식)과 유사한 것 같습니다. 다만 EKS API의 access entry에는 쿠버네티스 그룹을 맵핑해줄 수도 있습니다.

AKS의 인증/인가 요약

요약하면, AKS에서는 특정 사용자/그룹에 한해 kubeconfig를 획득하는 권한을 할당해야 합니다.

Microsoft Entra ID와 통합하지 않은 경우에는 admin 권한의 kubeconfig를 사용하는 방식으로 사용할 수 있습니다.

Kubernetes의 인증을 Microsoft Entra ID와 통합할 수 있으며, 또한 인가 방식에서 Kubernetes RBAC과 Azure RBAC를 선택할 수 있습니다.

마지막으로 Microsoft Entra ID와 통합을 한 경우 local account를 비활성화 할 수 있습니다.

EKS와의 차이점을 보면, kubeconfig를 획득하기 위한 권한을 별도로 가지고 있다는 점과 Microsoft Entra ID를 인증을 하지 않는 local account 방식을 제공하는 부분이 있습니다. 또한 Azure RBAC을 통해 인가를 처리해줄 수 있는 부분이 있습니다.

4. Kubernetes의 파드 권한

쿠버네티스에서 파드에 ServiceAccount를 부여하고, ServiceAccount라는 주체를 RBAC으로 구성하면 쿠버네티스의 리소스에 대한 권한을 할당 받습니다. 예를 들어, 배포를 담당하는 파드가 있고, 해당 파드 할당된 ServiceAccount에 deployments에 대한 CRUD를 허용하면, 파드에서 deployment 배포가 가능하게 됩니다.

다만 파드가 쿠버네티스의 리소스가 아닌 클라우드 자원 자체에 접근한다는 것은 다른 이야기입니다. 예를 들어, 파드가 AWS의 S3를 조회하거나 파일을 업로드하는 것입니다. 즉, 쿠버네티스에서 인증/인가할 수 있는 범위를 넘어서, 클라우드에서 제공하는 Identity 및 Access 관리 솔루션을 통해서 인증/인가를 받아야 합니다.

앞서 살펴본 EKS의 인증/인가 절에서는 AWS에서 유효한 사용자가 어떻게 쿠버네티스의 인증/인가를 이용할 수 있는가의 관점이라면, 지금 다루는 주제는 쿠버네티스에서 유효한 주체가 어떻게 AWS의 인증/인가를 이용할 수 있는가에 대한 문제입니다.

EKS에서는 IRSA(IAM Roles for Service Accounts)와 Pod Identity라는 방식을 제공하고 있고, 이를 다음 절 EKS의 파드 권한 할당에서 자세히 살펴보겠습니다.

5. EKS의 파드 권한 할당

아무런 권한을 부여하지 않은 파드는 노드의 권한을 가지게 됩니다. AWS에서는 인스턴스에 부여된 IAM Role을 가지게 됩니다.

아래와 같이 확인을 해볼 수 있습니다.

# awscli 파드 생성
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: awscli-pod
spec:
  replicas: 2
  selector:
    matchLabels:
      app: awscli-pod
  template:
    metadata:
      labels:
        app: awscli-pod
    spec:
      containers:
      - name: awscli-pod
        image: amazon/aws-cli
        command: ["tail"]
        args: ["-f", "/dev/null"]
      terminationGracePeriodSeconds: 0
EOF

# 파드 생성 확인
kubectl get pod -owide

# 파드 이름 변수 지정
APODNAME1=$(kubectl get pod -l app=awscli-pod -o jsonpath="{.items[0].metadata.name}")
APODNAME2=$(kubectl get pod -l app=awscli-pod -o jsonpath="{.items[1].metadata.name}")
echo $APODNAME1, $APODNAME2

# awscli 파드에서 EC2 InstanceProfile(IAM Role)의 ARN 정보가 확인됨
kubectl exec -it $APODNAME1 -- aws sts get-caller-identity --query Arn
"arn:aws:sts::xx:assumed-role/eksctl-myeks-nodegroup-ng1-NodeInstanceRole-e7CGUnBoQC96/i-0e6fd9f697b0a4c2f"
kubectl exec -it $APODNAME2 -- aws sts get-caller-identity --query Arn
"arn:aws:sts::xx:assumed-role/eksctl-myeks-nodegroup-ng1-NodeInstanceRole-e7CGUnBoQC96/i-04ed117980b8faf7f"

# 해당 IAM Role에 권한이 없기 때문에 실패함
kubectl exec -it $APODNAME1 -- aws s3 ls
An error occurred (AccessDenied) when calling the ListBuckets operation: User: arn:aws:sts::xx:assumed-role/eksctl-myeks-nodegroup-ng1-NodeInstanceRole-e7CGUnBoQC96/i-0e6fd9f697b0a4c2f is not authorized to perform: s3:ListAllMyBuckets because no identity-based policy allows the s3:ListAllMyBuckets action
command terminated with exit code 254

그렇다면 인스턴스에 부여된 IAM Role을 통하여 권한을 할당하면 된다고 생각할 수 있지만, 이런 방식에서는 노드에 실행된 모든 파드에서 동일한 권한이 부여되기 때문에 최소 권한에 위배됩니다.

참고로 EKS를 생성하는 ClusterConfig에 관리 노드 그룹에 대해 아래와 같이 IAM을 지정할 수 있습니다.

...
managedNodeGroups:
- amiFamily: AmazonLinux2023
  desiredCapacity: 3
  iam:
    withAddonPolicies:
      autoScaler: true
      certManager: true
      externalDNS: true
  instanceType: t3.medium
...

웹 콘솔에서는 아래와 같이 인스턴스에 할당된 IAM을 따라가보면 아래와 같이 확인할 수 있습니다.

이러한 이유로 권한이 필요한 파드에 IAM Role을 부여하는 방식이 필요합니다. 파드 권한 할당을 위해 EKS에서 제공하는 IRSA와 Pod Identity를 확인해보겠습니다.

IRSA(IAM Roles for Service Accounts)

IRSA는 권한이 부여된 IAM Role을 SerivceAccount에 할당하고, 파드가 ServiceAccount를 사용하여 AWS의 인증을 통해 AWS 리소스를 접근하는 방식입니다. 이를 위해서 OIDC Issuer가 JWT를 발급해주고, 또한 IAM과 신뢰관계를 통해서 발급 여부를 확인해줍니다.

IRSA는 아래와 같은 절차를 통해서 이뤄집니다.

출처: https://github.com/awskrug/security-group/blob/main/files/AWSKRUG_2024_02_EKS_ROLE_MANAGEMENT.pdf

실습을 통해 살펴보겠습니다.

# Create an iamserviceaccount - AWS IAM role bound to a Kubernetes service account
eksctl create iamserviceaccount \
  --name my-sa \
  --namespace default \
  --cluster $CLUSTER_NAME \
  --approve \
  --attach-policy-arn $(aws iam list-policies --query 'Policies[?PolicyName==`AmazonS3ReadOnlyAccess`].Arn' --output text)

2025-03-15 23:58:25 [ℹ]  1 existing iamserviceaccount(s) (kube-system/aws-load-balancer-controller) will be excluded
2025-03-15 23:58:25 [ℹ]  1 iamserviceaccount (default/my-sa) was included (based on the include/exclude rules)
2025-03-15 23:58:25 [!]  serviceaccounts that exist in Kubernetes will be excluded, use --override-existing-serviceaccounts to override
2025-03-15 23:58:25 [ℹ]  1 task: {
    2 sequential sub-tasks: {
        create IAM role for serviceaccount "default/my-sa",
        create serviceaccount "default/my-sa",
    } }2025-03-15 23:58:25 [ℹ]  building iamserviceaccount stack "eksctl-myeks-addon-iamserviceaccount-default-my-sa"
2025-03-15 23:58:25 [ℹ]  deploying stack "eksctl-myeks-addon-iamserviceaccount-default-my-sa"
2025-03-15 23:58:25 [ℹ]  waiting for CloudFormation stack "eksctl-myeks-addon-iamserviceaccount-default-my-sa"
2025-03-15 23:58:56 [ℹ]  waiting for CloudFormation stack "eksctl-myeks-addon-iamserviceaccount-default-my-sa"
2025-03-15 23:58:56 [ℹ]  created serviceaccount "default/my-sa"

# SA 확인
kubectl get sa
NAME      SECRETS   AGE
default   0         122m
my-sa     0         4m37s

kubectl describe sa my-sa
Name:                my-sa
Namespace:           default
Labels:              app.kubernetes.io/managed-by=eksctl
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::xx:role/eksctl-myeks-addon-iamserviceaccount-default--Role1-MYPji4gGE3x2
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

쿠버네티스에 새로운 SerivceAccount가 생성되었고, 어노테이션으로 ARN이 지정된 것을 알 수 있습니다.

또한 위 명령을 수행하면 CloudFormation이 실행되고 IAM Role이 생성됩니다. CloudFormation을 확인해서 리소스를 보면 IAM Role이 생성된 것을 확인할 수 있습니다.

생성된 ServiceAccount를 통해 IRSA를 사용하는 파드를 생성 합니다.

# 파드 생성
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: eks-iam-test3
spec:
  serviceAccountName: my-sa
  containers:
    - name: my-aws-cli
      image: amazon/aws-cli:latest
      command: ['sleep', '36000']
  restartPolicy: Never
  terminationGracePeriodSeconds: 0
EOF

# 파드에서 aws cli 사용 확인
NAMESPACE       NAME                            ROLE ARN
default         my-sa                           arn:aws:iam::xx:role/eksctl-myeks-addon-iamserviceaccount-default--Role1-MYPji4gGE3x2
kube-system     aws-load-balancer-controller    arn:aws:iam::xx:role/eksctl-myeks-addon-iamserviceaccount-kube-sys-Role1-oyUqvqXumqCT

kubectl exec -it eks-iam-test3 -- aws sts get-caller-identity --query Arn
"arn:aws:sts::xx:assumed-role/eksctl-myeks-addon-iamserviceaccount-default--Role1-MYPji4gGE3x2/botocore-session-1742051277"

# 할당된 Policy에 의해 가능한 작업 (에러 발생하지 않음)
kubectl exec -it eks-iam-test3 -- aws s3 ls

# 할당된 Policy에 의해 불가한 작업
kubectl exec -it eks-iam-test3 -- aws ec2 describe-instances --region ap-northeast-2
An error occurred (UnauthorizedOperation) when calling the DescribeInstances operation: You are not authorized to perform this operation. User: arn:aws:sts::xx:assumed-role/eksctl-myeks-addon-iamserviceaccount-default--Role1-MYPji4gGE3x2/botocore-session-1742051277 is not authorized to perform: ec2:DescribeInstances because no identity-based policy allows the ec2:DescribeInstances action
command terminated with exit code 254

파드 스펙에는 ServiceAccount 이름만 지정하지만, Admission Controller의 Mutation Webhook에 의해서 필요한 정보가 추가로 등록된 것을 확인할 수 있습니다.

# 해당 SA를 파드가 사용 시 mutatingwebhook으로 Env,Volume 추가함: AWS IAM 역할을 Pod에 자동으로 주입
kubectl get mutatingwebhookconfigurations pod-identity-webhook -o yaml
...
webhooks:
- admissionReviewVersions:
  - v1beta1
  clientConfig:
    caBundle: xxx
    url: https://127.0.0.1:23443/mutate
  failurePolicy: Ignore
  matchPolicy: Equivalent
  name: iam-for-pods.amazonaws.com
  namespaceSelector: {}
  objectSelector:
    matchExpressions:
    - key: eks.amazonaws.com/skip-pod-identity-webhook
      operator: DoesNotExist
  reinvocationPolicy: IfNeeded
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    scope: '*'
  sideEffects: None
  timeoutSeconds: 10
...

# Pod Identity Webhook은 mutating webhook을 통해 아래 Env 내용과 1개의 볼륨을 추가함
kubectl get pod eks-iam-test3
kubectl get pod eks-iam-test3 -o yaml
...
    env:
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: ap-northeast-2
    - name: AWS_REGION
      value: ap-northeast-2
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::xx:role/eksctl-myeks-addon-iamserviceaccount-default--Role1-MYPji4gGE3x2
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
...
    volumeMounts: 
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
...
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
...

마운트된 토큰을 확인해보겠습니다.

# 토큰 확인
kubectl exec -it eks-iam-test3 -- cat /var/run/secrets/eks.amazonaws.com/serviceaccount/token ; echo

JWT로 디코드 해보면 아래와 같은 정보를 확인할 수 있습니다.

{
  "aud": [
    "sts.amazonaws.com"
  ],
  "exp": 1742137620,
  "iat": 1742051220,
  "iss": "https://oidc.eks.ap-northeast-2.amazonaws.com/id/EF882B...",
  "jti": "585eab74-0dd1-4047-a8d5-2181c3db9c13",
  "kubernetes.io": {
    "namespace": "default",
    "node": {
      "name": "ip-192-168-1-41.ap-northeast-2.compute.internal",
      "uid": "fb4be118-8152-452a-a0aa-eaff394022e2"
    },
    "pod": {
      "name": "eks-iam-test3",
      "uid": "784ef7a7-0acd-4394-bcd7-c4f71f5c101f"
    },
    "serviceaccount": {
      "name": "my-sa",
      "uid": "bd180c18-c38d-4974-8f34-8d9b2f7b37c8"
    }
  },
  "nbf": 1742051220,
  "sub": "system:serviceaccount:default:my-sa"
}

JWT 토큰의 iss를 확인해보면, 웹 콘솔의 EKS를 확인해보면 OpenID Connect provider URL와 일치하는 것을 알 수 있습니다.

실습을 마무리하고 IRSA 관련 리소스는 삭제하겠습니다.

# 실습 확인 후 파드 삭제 및 IRSA 제거
kubectl delete deply awscli-pod
kubectl delete pod eks-iam-test3
eksctl delete iamserviceaccount --cluster $CLUSTER_NAME --name my-sa --namespace default

# 확인
eksctl get iamserviceaccount --cluster $CLUSTER_NAME
kubectl get sa

이러한 IRSA는 관리 복잡성과, ServiceAccount를 세부적으로 지정하지 않는 경우 보안에 취약한 점 등으로 현재는Classic으로 여겨지고 이후 Pod Identity가 도입되었습니다.

Pod Identity

IRSA는 2019년에 도입되었다면, Pod Identity는 2023년 비교적 최근에 도입된 방식으로 보안과 사용성 측면에서 개선된 사항이 많습니다.

Pod Identity는 EKS Pod Identity Agent를 통해서 credentials을 발급 받고, EKS Auth API를 통해서 인증을 처리 받습니다. 아래의 처리과정을 참고 부탁드립니다.

출처: https://aws.amazon.com/ko/blogs/containers/amazon-eks-pod-identity-a-new-way-for-applications-on-eks-to-obtain-iam-credentials/

아래와 같이 실습을 진행하겠습니다.

Pod Identity는 애드온으로 설치가 가능합니다.

# Pod Identity 버전 확인
ADDON=eks-pod-identity-agent
aws eks describe-addon-versions \
    --addon-name $ADDON \
    --kubernetes-version 1.31 \
    --query "addons[].addonVersions[].[addonVersion, compatibilities[].defaultVersion]" \
    --output text
v1.3.5-eksbuild.2
False
v1.3.4-eksbuild.1
True
v1.3.2-eksbuild.2
False
v1.3.0-eksbuild.1
False
v1.2.0-eksbuild.1
False
v1.1.0-eksbuild.1
False
v1.0.0-eksbuild.1
False

# 설치
eksctl create addon --cluster $CLUSTER_NAME --name eks-pod-identity-agent --version 1.3.5

# 확인
eksctl get addon --cluster $CLUSTER_NAME

NAME                    VERSION                 STATUS          ISSUES  IAMROLE                                                                                 UPDATE AVAILABLE CONFIGURATION VALUES            POD IDENTITY ASSOCIATION ROLES
aws-ebs-csi-driver      v1.40.1-eksbuild.1      ACTIVE          0       arn:aws:iam::xx:role/eksctl-myeks-addon-aws-ebs-csi-driver-Role1-15a6w33Xm4wR
coredns                 v1.11.4-eksbuild.2      ACTIVE          0
eks-pod-identity-agent  v1.3.5-eksbuild.2       CREATING        0
kube-proxy              v1.31.3-eksbuild.2      ACTIVE          0
metrics-server          v0.7.2-eksbuild.2       ACTIVE          0
vpc-cni                 v1.19.3-eksbuild.1      ACTIVE          0       arn:aws:iam::xx:role/eksctl-myeks-addon-vpc-cni-Role1-RS9uYpCia7T9            enableNetworkPolicy: "true"

# 데몬 셋으로 설치됨
kubectl -n kube-system get daemonset eks-pod-identity-agent

NAME                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
eks-pod-identity-agent   3         3         3       3            3           <none>          33s

아래와 같이 Pod Identity Association을 생성합니다.

# Pod Identity Association을 생성
eksctl create podidentityassociation \
--cluster $CLUSTER_NAME \
--namespace default \
--service-account-name s3-sa \
--role-name s3-eks-pod-identity-role \
--permission-policy-arns arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--region ap-northeast-2

2025-03-16 00:22:39 [ℹ]  1 task: {
    2 sequential sub-tasks: {
        create IAM role for pod identity association for service account "default/s3-sa",
        create pod identity association for service account "default/s3-sa",
    } }2025-03-16 00:22:39 [ℹ]  deploying stack "eksctl-myeks-podidentityrole-default-s3-sa"
2025-03-16 00:22:40 [ℹ]  waiting for CloudFormation stack "eksctl-myeks-podidentityrole-default-s3-sa"
2025-03-16 00:23:10 [ℹ]  waiting for CloudFormation stack "eksctl-myeks-podidentityrole-default-s3-sa"
2025-03-16 00:23:11 [ℹ]  created pod identity association for service account "s3-sa" in namespace "default"
2025-03-16 00:23:11 [ℹ]  all tasks were completed successfully

# 확인
kubectl get sa

NAME      SECRETS   AGE
default   0         142m

eksctl get podidentityassociation --cluster $CLUSTER_NAME
ASSOCIATION ARN                                                                                 NAMESPACE       SERVICE ACCOUNT NAME    IAM ROLE ARN            OWNER ARN
arn:aws:eks:ap-northeast-2:xx:podidentityassociation/myeks/a-8zp14caxh5ask7ed0        default         s3-sa                   arn:aws:iam::xx:role/s3-eks-pod-identity-role

aws eks list-pod-identity-associations --cluster-name $CLUSTER_NAME | jq
{
  "associations": [
    {
      "clusterName": "myeks",
      "namespace": "default",
      "serviceAccount": "s3-sa",
      "associationArn": "arn:aws:eks:ap-northeast-2:xx:podidentityassociation/myeks/a-8zp14caxh5ask7ed0",
      "associationId": "a-8zp14caxh5ask7ed0"
    }
  ]
}

eksctl create podidentityassociation 또한 CloudFormation을 실행하도록 동작하며, ServiceAccount 는 별도로 생성되지 않습니다.

웹 콘솔을 확인해보면 Pod Association에서 정보가 확인 가능합니다. IRSA는 웹 콘솔에는 노출되지 않습니다.

Pod Identity를 사용하는 파드도 연관된 ServiceAccount 이름을 지정하는 것으로 사용할 수 있습니다.

# 서비스어카운트, 파드 생성
kubectl create sa s3-sa

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: eks-pod-identity
spec:
  serviceAccountName: s3-sa
  containers:
    - name: my-aws-cli
      image: amazon/aws-cli:latest
      command: ['sleep', '36000']
  restartPolicy: Never
  terminationGracePeriodSeconds: 0
EOF

# 파드 정보 확인
kubectl get pod eks-pod-identity -o yaml 
...
    env:
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: ap-northeast-2
    - name: AWS_REGION
      value: ap-northeast-2
    - name: AWS_CONTAINER_CREDENTIALS_FULL_URI
      value: http://169.254.170.23/v1/credentials
    - name: AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE
      value: /var/run/secrets/pods.eks.amazonaws.com/serviceaccount/eks-pod-identity-token
...
    - mountPath: /var/run/secrets/pods.eks.amazonaws.com/serviceaccount
      name: eks-pod-identity-token
      readOnly: true
...
  volumes:
  - name: eks-pod-identity-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: pods.eks.amazonaws.com
          expirationSeconds: 86400
          path: eks-pod-identity-token
...

# Pod Identity로 정보 확인
kubectl exec -it eks-pod-identity -- aws sts get-caller-identity --query Arn
"arn:aws:sts::xx:assumed-role/s3-eks-pod-identity-role/eks-myeks-eks-pod-id-0382fb7d-1b2b-45c2-84bb-d5b123292589"

# 에러 발생하지 않음
kubectl exec -it eks-pod-identity -- aws s3 ls

# 에러 발생
kubectl exec -it eks-pod-identity -- aws ec2 describe-instances --region ap-northeast-2
An error occurred (UnauthorizedOperation) when calling the DescribeInstances operation: You are not authorized to perform this operation. User: arn:aws:sts::xx:assumed-role/s3-eks-pod-identity-role/eks-myeks-eks-pod-id-0382fb7d-1b2b-45c2-84bb-d5b123292589 is not authorized to perform: ec2:DescribeInstances because no identity-based policy allows the ec2:DescribeInstances action
command terminated with exit code 254

마찬가지로 토큰을 확인해보겠습니다.

# 토큰 정보 확인
kubectl exec -it eks-pod-identity -- cat /var/run/secrets/pods.eks.amazonaws.com/serviceaccount/eks-pod-identity-token; echo

aud(Audience) 가 STS에서 pods.eks.amazonaws.com 으로 다른 것을 알 수 있습니다.

{
  "aud": [
    "pods.eks.amazonaws.com"
  ],
  "exp": 1742138781,
  "iat": 1742052381,
  "iss": "https://oidc.eks.ap-northeast-2.amazonaws.com/id/EF882B...",
  "jti": "50985ae4-392b-4caa-ae3c-c4de13f58e3c",
  "kubernetes.io": {
    "namespace": "default",
    "node": {
      "name": "ip-192-168-2-79.ap-northeast-2.compute.internal",
      "uid": "9e28f27b-c643-4c76-9e33-3658ca4014ed"
    },
    "pod": {
      "name": "eks-pod-identity",
      "uid": "276bf7da-851f-4eec-87d5-07488e972f2a"
    },
    "serviceaccount": {
      "name": "s3-sa",
      "uid": "7438f7de-ec70-435f-8d34-de7a239a955e"
    }
  },
  "nbf": 1742052381,
  "sub": "system:serviceaccount:default:s3-sa"
}

실습을 마치고 리소스를 삭제하겠습니다.

eksctl delete podidentityassociation --cluster $CLUSTER_NAME --namespace default --service-account-name s3-sa
kubectl delete pod eks-pod-identity
kubectl delete sa s3-sa

6. AKS의 파드 권한 할당

AKS에서는 파드에 Azure의 리소스에 접근하는 권한을 할당하는 방식으로 Workload Identity를 사용할 수 있습니다.

이 방식은 Azure에서 Managed Identity라는 주체를 생성하고, Azure RBAC을 통해서 권한 관리를 하며, Managed Identity를 ServiceAccount에서 사용하는 방식으로 진행됩니다.

또한 AKS에서 workload Identity은 --enable-oidc-issuer와 --enable-workload-identity 옵션을 통해 활성화 할 수 있습니다.

az aks create --resource-group "${RESOURCE_GROUP}" --name "${CLUSTER_NAME}" --enable-oidc-issuer --enable-workload-identity --generate-ssh-keys

AKS의 Workload Identity의 동작 과정은 아래의 문서를 살펴보실 수 있습니다.

https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview?tabs=dotnet

실제 클러스터에 구성하는 방식은 아래에서 설명하고 있습니다.

https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster

절차를 요약하면 아래와 같습니다.

1) 클러스터에 Workload Identity와 OIDC issuer 활성화

2) Managed Identity 생성

3) 쿠버네티스 ServiceAccount 생성 (Managed Identity의 Client ID 입력)

4) Federated Identity Credentials 생성

Managed Identity와 OIDC Issuer, 그리고 주체(ServiceAccount)를 연결

export FEDERATED_IDENTITY_CREDENTIAL_NAME="myFedIdentity$RANDOM_ID"
az identity federated-credential create --name ${FEDERATED_IDENTITY_CREDENTIAL_NAME} --identity-name "${USER_ASSIGNED_IDENTITY_NAME}" --resource-group "${RESOURCE_GROUP}" --issuer "${AKS_OIDC_ISSUER}" --subject system:serviceaccount:"${SERVICE_ACCOUNT_NAMESPACE}":"${SERVICE_ACCOUNT_NAME}" --audience api://AzureADTokenExchange

5) Managed Identity에 대한 Azure 리소스 권한 할당 (생략)

6) 애플리케이션 생성 (azure.workload.identity/use: "true" label 및 ServiceAccount 입력)

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
    name: sample-workload-identity-key-vault
    namespace: ${SERVICE_ACCOUNT_NAMESPACE}
    labels:
        azure.workload.identity/use: "true"
spec:
    serviceAccountName: ${SERVICE_ACCOUNT_NAME}
    containers:
      - image: ghcr.io/azure/azure-workload-identity/msal-go
        name: oidc
        env:
          - name: KEYVAULT_URL
            value: ${KEYVAULT_URL}
          - name: SECRET_NAME
            value: ${KEYVAULT_SECRET_NAME}
    nodeSelector:
        kubernetes.io/os: linux
EOF

이러한 과정은 eksctl 을 사용하는 EKS에 비해서 다소 복잡하게 느껴지기는 합니다. 한편으로는 EKS는 신규 API를 추가하여 기능을 간단하게 제공하고, AKS는 기존 Azure의 주체를 활용하는 방식으로 기존 Azure API를 통한 처리를 하는 것으로 이해됩니다.

마무리

해당 포스트에서 EKS의 인증/인가와 파드에 IAM 권한을 할당하는 방식을 살펴봤습니다.

이 과정에서 EKS가 AWS의 IAM이라는 ID 및 엑세스 관리 서비스와 연계되는 방식을 살펴봤습니다.

이러한 과정은 서로 다른 시스템 간의 인증이 연동되는 방식으로 이해할 수 있으며, AWS에서 유효한 주체가 어떻게 쿠버네티스의 인증/인가를 이용할 수 있는가와 쿠버네티스에서 유효한 주체가 어떻게 AWS의 인증/인가를 이용할 수 있는가에 대한 답변이 되었으면 좋겠습니다.

간단히 요약하면 AWS의 사용자는 API 서버의 Token Webhook Authentication으로 IAM을 통해서 인증을 진행하고, 확인된 ARN 정보와 쿠버네티스의 그룹(혹은 Policy)의 맵핑을 확인하여 쿠버네티스 RBAC을 통해 인가가 이뤄줬습니다.

쿠버네티스의 유효한 주체는 OIDC나 혹은 Pod Identity Agent를 통해서 AWS IAM과 연계 및 인증을 통해서 유효한 토큰을 획득하고 AWS의 리소스를 접근할 수 있었습니다.

다음 포스트에서는 EKS에서 Fargate, Hybrid node를 사용하는 방식을 살펴보겠습니다.

저작자표시

'EKS' 카테고리의 다른 글

[8] EKS Upgrade (0)	2025.04.02
[7] EKS Fargate (0)	2025.03.23
[5-2] EKS의 오토스케일링 Part2 (0)	2025.03.07
[5-1] EKS의 오토스케일링 Part1 (0)	2025.03.07
[4] EKS의 모니터링과 로깅 (0)	2025.03.01

[5-2] EKS의 오토스케일링 Part2

한명 2025. 3. 7. 01:24

2025. 3. 7. 01:24

본 포스트에서는 기본적인 쿠버네티스 환경의 스케일링 기술을 살펴보겠습니다. 이후 EKS의 오토스케일링 옵션을 살펴보고, 각 옵션을 실습을 통해 살펴도록 하겠습니다. 마지막으로 AKS의 오토스케일링 옵션을 EKS와 비교해 보겠습니다.

이번 포스트에서는 EKS의 오토스케일링(Autoscaling) Part2로 지난 포스트에 이어서 Cluster Autoscaler 부터 이어나가겠습니다.

1. CA(Cluster Autoscaler)

노드를 스케일링하는 CA(Cluster Autoscaler)를 살펴보겠습니다.

많은 사람들이 클라우드 환경에서 컴퓨팅 자원을 기반으로 한 오토스케일링에 대한 이해를 하고 있기 때문에, 가상 머신 세트(예를 들어, ASG, VMSS 등)의 CPU/Memory와 같은 리소스 사용률이 CA를 동작시키는 것으로 오해하는 경우가 많습니다.

하지만 쿠버네티스의 CA는 아래와 같은 상황에서 동작합니다.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-is-cluster-autoscaler

Cluster Autoscaler increases the size of the cluster when:

there are pods that failed to schedule on any of the current nodes due to insufficient resources.
adding a node similar to the nodes currently present in the cluster would help.

Cluster Autoscaler decreases the size of the cluster when some nodes are consistently unneeded for a significant amount of time. A node is unneeded when it has low utilization and all of its important pods can be moved elsewhere.

즉, 현재 노드의 리소스가 부족하여 파드가 스케줄링이 될 수 없는 상황에서 노드의 수를 증가시키게 됩니다. 그러하므로 Pending 파드, 정확하게는 unschedurable 파드가 발생한 상황에서 노드 수가 증가하는 개념입니다.

아래는 EKS에서 HPA와 CA를 설명하고 있습니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

EKS의 CA는 노드의 아래 두가지 태그가 등록되어 있는 노드들에 대해서 동작합니다. 아래와 같이 사전 정보를 확인하실 수 있습니다.

# EKS 노드에 이미 아래 tag가 들어가 있음
# k8s.io/cluster-autoscaler/enabled : true
# k8s.io/cluster-autoscaler/myeks : owned
aws ec2 describe-instances  --filters Name=tag:Name,Values=$CLUSTER_NAME-ng1-Node --query "Reservations[*].Instances[*].Tags[*]" --output yaml
...
- Key: k8s.io/cluster-autoscaler/myeks
      Value: owned
- Key: k8s.io/cluster-autoscaler/enabled
      Value: 'true'
...

CA가 동작할 수 있도록 ASG의 MaxSize를 6개로 사전에 수정합니다.

# 현재 autoscaling(ASG) 정보 확인
aws autoscaling describe-auto-scaling-groups \
    --query "AutoScalingGroups[? Tags[? (Key=='eks:cluster-name') && Value=='myeks']].[AutoScalingGroupName, MinSize, MaxSize,DesiredCapacity]" \
    --output table
-----------------------------------------------------------------
|                   DescribeAutoScalingGroups                   |
+------------------------------------------------+----+----+----+
|  eks-ng1-70cab5c8-890d-c414-cc6d-c0d2eac06322  |  3 |  3 |  3 |
+------------------------------------------------+----+----+----+

# MaxSize 6개로 수정
export ASG_NAME=$(aws autoscaling describe-auto-scaling-groups --query "AutoScalingGroups[? Tags[? (Key=='eks:cluster-name') && Value=='myeks']].AutoScalingGroupName" --output text)
aws autoscaling update-auto-scaling-group --auto-scaling-group-name ${ASG_NAME} --min-size 3 --desired-capacity 3 --max-size 6

# 확인
aws autoscaling describe-auto-scaling-groups --query "AutoScalingGroups[? Tags[? (Key=='eks:cluster-name') && Value=='myeks']].[AutoScalingGroupName, MinSize, MaxSize,DesiredCapacity]" --output table
-----------------------------------------------------------------
|                   DescribeAutoScalingGroups                   |
+------------------------------------------------+----+----+----+
|  eks-ng1-70cab5c8-890d-c414-cc6d-c0d2eac06322  |  3 |  6 |  3 |
+------------------------------------------------+----+----+----+

이제 클러스터에 CA를 설치 하겠습니다.

# 배포 : Deploy the Cluster Autoscaler (CAS)
curl -s -O https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
...
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false # 로컬 스토리지를 가진 노드를 autoscaler가 scale down할지 결정, false(가능!)
            - --expander=least-waste # 노드를 확장할 때 어떤 노드 그룹을 선택할지를 결정, least-waste는 리소스 낭비를 최소화하는 방식으로 새로운 노드를 선택.
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/<YOUR CLUSTER NAME>
...

sed -i -e "s|<YOUR CLUSTER NAME>|$CLUSTER_NAME|g" cluster-autoscaler-autodiscover.yaml
kubectl apply -f cluster-autoscaler-autodiscover.yaml

cluster-autoscaler 파드(디플로이먼트)가 노드에 실행되는 것을 확인할 수 있습니다.

# 확인
kubectl get pod -n kube-system | grep cluster-autoscaler
cluster-autoscaler-6df6d76b9f-ss5gd           1/1     Running   0          11s

# node-group-auto-discovery에서 활용되는 asg:tag를 확인할 수 있습니다.
kubectl describe deployments.apps -n kube-system cluster-autoscaler | grep node-group-auto-discovery
      --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/myeks

# (옵션) cluster-autoscaler 파드가 동작하는 워커 노드가 퇴출(evict) 되지 않게 설정
kubectl -n kube-system annotate deployment.apps/cluster-autoscaler cluster-autoscaler.kubernetes.io/safe-to-evict="false"

아래 예제를 통해서 CA의 동작을 확인합니다.

# 노드 모니터링 
while true; do date; kubectl get node; echo "------------------------------" ; sleep 5; done

# Deploy a Sample App
# We will deploy an sample nginx application as a ReplicaSet of 1 Pod
cat << EOF > nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-to-scaleout
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        service: nginx
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx-to-scaleout
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 512Mi
EOF
kubectl apply -f nginx.yaml
kubectl get deployment/nginx-to-scaleout

# Scale our ReplicaSet
# Let’s scale out the replicaset to 15
kubectl scale --replicas=15 deployment/nginx-to-scaleout && date

deployment.apps/nginx-to-scaleout scaled
Thu Mar  6 23:48:09 KST 2025

# 확인
kubectl get po |grep Pending
nginx-to-scaleout-7cfb655fb5-4vtb9   0/1     Pending   0          20s
nginx-to-scaleout-7cfb655fb5-6z6lk   0/1     Pending   0          20s
nginx-to-scaleout-7cfb655fb5-9g7s6   0/1     Pending   0          20s
nginx-to-scaleout-7cfb655fb5-ckph6   0/1     Pending   0          20s
nginx-to-scaleout-7cfb655fb5-lqbhc   0/1     Pending   0          20s
nginx-to-scaleout-7cfb655fb5-vk5bb   0/1     Pending   0          20s
nginx-to-scaleout-7cfb655fb5-vwnv7   0/1     Pending   0          20s

# 노드 자동 증가 확인
kubectl get nodes
aws autoscaling describe-auto-scaling-groups \
    --query "AutoScalingGroups[? Tags[? (Key=='eks:cluster-name') && Value=='myeks']].[AutoScalingGroupName, MinSize, MaxSize,DesiredCapacity]" \
    --output table
-----------------------------------------------------------------
|                   DescribeAutoScalingGroups                   |
+------------------------------------------------+----+----+----+
|  eks-ng1-70cab5c8-890d-c414-cc6d-c0d2eac06322  |  3 |  6 |  6 |
+------------------------------------------------+----+----+----+

# [운영서버 EC2] 최근 1시간 Fleet API 호출 확인 - Link
# https://ap-northeast-2.console.aws.amazon.com/cloudtrailv2/home?region=ap-northeast-2#/events?EventName=CreateFleet
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=CreateFleet \
  --start-time "$(date -d '1 hour ago' --utc +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date --utc +%Y-%m-%dT%H:%M:%SZ)"

{
    "Events": [
        {
            "EventId": "d16d3ea9-58ef-4d1e-8776-6172f2ea0d4a",
            "EventName": "CreateFleet",
            "ReadOnly": "false",
            "EventTime": "2025-03-06T23:48:25+09:00",
            "EventSource": "ec2.amazonaws.com",
            "Username": "AutoScaling",
            "Resources": [],
         ...

# (참고) Event name : UpdateAutoScalingGroup
# https://ap-northeast-2.console.aws.amazon.com/cloudtrailv2/home?region=ap-northeast-2#/events?EventName=UpdateAutoScalingGroup

EKS에서 Pending Pod가 발생한 이후 노드 생성 시점을 시간을 확인해보고, 비슷한 테스트를 AKS에서 진행한 경우 Pending Pod와 노드 생성 시점을 비교해봤습니다.

먼저 EKS는 t3.medium(2 vCPU, 4GB)을 사용했고, AKS에서도 Burstable에 해당하는 Standard_B2s(2 vCPU, 4GB)를 사용했습니다.

EKS CA 테스트

# 애플리케이션 추가
kubectl scale --replicas=15 deployment/nginx-to-scaleout && date
deployment.apps/nginx-to-scaleout scaled
Thu Mar  6 23:48:09 KST 2025

# 노드 생성 전
------------------------------
Thu Mar  6 23:49:03 KST 2025
NAME                                               STATUS   ROLES    AGE    VERSION
ip-192-168-1-87.ap-northeast-2.compute.internal    Ready    <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-2-195.ap-northeast-2.compute.internal   Ready    <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-3-136.ap-northeast-2.compute.internal   Ready    <none>   100m   v1.31.5-eks-5d632ec
...
# 노드 추가 -> 대략 1:10초 걸림
------------------------------
Thu Mar  6 23:49:17 KST 2025
NAME                                               STATUS     ROLES    AGE    VERSION
ip-192-168-1-67.ap-northeast-2.compute.internal    NotReady   <none>   5s     v1.31.5-eks-5d632ec
ip-192-168-1-87.ap-northeast-2.compute.internal    Ready      <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-2-195.ap-northeast-2.compute.internal   Ready      <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-2-246.ap-northeast-2.compute.internal   NotReady   <none>   10s    v1.31.5-eks-5d632ec
ip-192-168-3-136.ap-northeast-2.compute.internal   Ready      <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-3-229.ap-northeast-2.compute.internal   NotReady   <none>   4s     v1.31.5-eks-5d632ec
...
# 전체 Ready -> 대략 1:30초 걸림
------------------------------
Thu Mar  6 23:49:32 KST 2025
NAME                                               STATUS   ROLES    AGE    VERSION
ip-192-168-1-67.ap-northeast-2.compute.internal    Ready    <none>   19s    v1.31.5-eks-5d632ec
ip-192-168-1-87.ap-northeast-2.compute.internal    Ready    <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-2-195.ap-northeast-2.compute.internal   Ready    <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-2-246.ap-northeast-2.compute.internal   Ready    <none>   24s    v1.31.5-eks-5d632ec
ip-192-168-3-136.ap-northeast-2.compute.internal   Ready    <none>   100m   v1.31.5-eks-5d632ec
ip-192-168-3-229.ap-northeast-2.compute.internal   Ready    <none>   18s    v1.31.5-eks-5d632ec

AKS CA 테스트

AKS에서도 동일하게 3~6으로 autoscaling 설정을 하였습니다.

결과와 시간을 볼 때는 유의미한 차이가 있는 것 같지는 않습니다. EKS와 AKS 모두 1분 30초 정도에 노드들이 추가 된 것으로 확인됩니다. 물론 이 테스트는 대략적인 시간을 확인한 것이므로 참고만 부탁드립니다.

# 애플리케이션 추가
$ kubectl scale --replicas=15 deployment/nginx-to-scaleout && date
deployment.apps/nginx-to-scaleout scaled
Thu Mar  6 15:08:17 UTC 2025

# 노드 생성 전
------------------------------
Thu Mar  6 15:09:41 UTC 2025
aks-userpool-13024277-vmss000000    Ready    <none>   9m56s   v1.31.4
aks-userpool-13024277-vmss000001    Ready    <none>   9m51s   v1.31.4
aks-userpool-13024277-vmss000002    Ready    <none>   9m57s   v1.31.4
# 노드 추가 -> 대략 1:30초 걸림
------------------------------
Thu Mar  6 15:09:46 UTC 2025
aks-userpool-13024277-vmss000000    Ready      <none>   10m     v1.31.4
aks-userpool-13024277-vmss000001    Ready      <none>   9m57s   v1.31.4
aks-userpool-13024277-vmss000002    Ready      <none>   10m     v1.31.4
aks-userpool-13024277-vmss000003    NotReady   <none>   1s      v1.31.4
aks-userpool-13024277-vmss000004    Ready      <none>   2s      v1.31.4
aks-userpool-13024277-vmss000005    Ready      <none>   1s      v1.31.4
# 전체 Ready -> 대략 1:35초 걸림
------------------------------
Thu Mar  6 15:09:52 UTC 2025
aks-userpool-13024277-vmss000000    Ready    <none>   10m   v1.31.4
aks-userpool-13024277-vmss000001    Ready    <none>   10m   v1.31.4
aks-userpool-13024277-vmss000002    Ready    <none>   10m   v1.31.4
aks-userpool-13024277-vmss000003    Ready    <none>   6s    v1.31.4
aks-userpool-13024277-vmss000004    Ready    <none>   7s    v1.31.4
aks-userpool-13024277-vmss000005    Ready    <none>   6s    v1.31.4
------------------------------

다음 실습을 위해서 리소스를 모두 삭제하겠습니다.

# 위 실습 중 디플로이먼트 삭제 후 10분 후 노드 갯수 축소되는 것을 확인 후 아래 삭제를 해보자! >> 만약 바로 아래 CA 삭제 시 워커 노드는 4개 상태가 되어서 수동으로 2대 변경 하자!
kubectl delete -f nginx.yaml

# size 수정 
aws autoscaling update-auto-scaling-group --auto-scaling-group-name ${ASG_NAME} --min-size 3 --desired-capacity 3 --max-size 3
aws autoscaling describe-auto-scaling-groups --query "AutoScalingGroups[? Tags[? (Key=='eks:cluster-name') && Value=='myeks']].[AutoScalingGroupName, MinSize, MaxSize,DesiredCapacity]" --output table

# Cluster Autoscaler 삭제
kubectl delete -f cluster-autoscaler-autodiscover.yaml

Karpenter에서는 공식 가이드를 참고하여 신규 클러스터를 사용하므로, 해당 실습을 마무리하면 아래와 같이 생성된 실습 환경도 삭제하겠습니다.

# eksctl delete cluster --name $CLUSTER_NAME && aws cloudformation delete-stack --stack-name $CLUSTER_NAME
nohup sh -c "eksctl delete cluster --name $CLUSTER_NAME && aws cloudformation delete-stack --stack-name $CLUSTER_NAME" > /root/delete.log 2>&1 &

# (옵션) 삭제 과정 확인
tail -f /root/delete.log

추가로 CA에 관련하여 AWS의 Workshop 문서를 참고하실 수 있습니다.

https://catalog.us-east-1.prod.workshops.aws/workshops/9c0aa9ab-90a9-44a6-abe1-8dff360ae428/ko-KR/100-scaling/200-cluster-scaling

2. Karpenter

이전까지 CA에 대해서 살펴보고 동작 과정을 실습해 보았습니다. CA는 CSP에서 제공하는 가상머신 세트(ex. ASG, VMSS)를 통해 노드를 스케일링하는 옵션입니다.

다만 CA는 사용자의 노드 그룹을 기준으로 스케일링을 하기 때문에 아래와 같은 한계점을 가지고 있습니다.

먼저 요구 조건별 많은 노드 그룹이 생성된 경우 복잡해지는 점과 파드의 용량(request) 관점이 아닌 노드 관점의 스케일링이 발생한다는 점입니다. 또한 CA는 내부적으로 Auto Scaling Group을 통해 EC2 인스턴스를 컨트롤 하기 때문에 일부 지연이 예상됩니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

이러한 CA의 복잡성과 지연을 극복하기 위해 Karpenter가 도입되었습니다. AWS에서 개발한 Karpenter는 현재 오픈소스로 전환하여 타 CSP에서도 사용 가능합니다.

Karpenter는 고성능의 지능형 쿠버네티스 스케일링 도구입니다. Karpenter는 CA와 다르게 Pending pods의 용량을 바탕으로 적합한 노드 사이즈를 선택합니다.

출처: https://www.youtube.com/watch?v=yMOaOlPvrgY&t=717s

또한 EC2 Fleet API로 인스턴스 생성을 요청하고, Watch API를 통해서 Pending Pod를 감시합니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

요약하면, CA와 Karpenter에는 아래와 같은 차이점이 있습니다.

CA는 10초에 한번씩 Pending(unschedulable) pod 이벤트를 체크하는 반면, Karpenter는 Watch를 통해서 즉시 감지할 수 있습니다.
CA는 CA -> ASG -> EC2 Fleet API로 ASG라는 단계를 추가로 거치게 되는데 비해, Karpenter가 ASG에 의존하지 않고 즉시 EC2 Fleet API에 호출하여 속도가 빠른 점이 있습니다. (여러 노드 그룹에 Pending Pod가 발생한다면 CA는 이를 순차 처리하기 때문에 더 늦어 질 수 있습니다)
CA는 Pending pods의 용량에 비례해서 증가하기 보다는 노드 그룹에 지정된 용량으로 노드가 증가합니다. 이 때문에 right size로 노드가 생성된다고 보기 어렵습니다. 반면 Karpenter의 경우 Pending pods를 batch로 판단할 수 있고, 이들의 용량에 적합한 인스턴스 사이즈를 결정합니다.

아래와 같이 Karpenter의 동작 과정 이해할 수 있습니다. 요약하면 감지(watch) -> 평가 -> Fleet 요청으로 이뤄집니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

실습을 진행하기 위해서 신규 EKS 클러스터를 생성하겠습니다.

# 변수 설정
export KARPENTER_NAMESPACE="kube-system"
export KARPENTER_VERSION="1.2.1"
export K8S_VERSION="1.32"
export AWS_PARTITION="aws" 
export CLUSTER_NAME="karpenter-demo" # ${USER}-karpenter-demo
export AWS_DEFAULT_REGION="ap-northeast-2"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
export TEMPOUT="$(mktemp)"
export ALIAS_VERSION="$(aws ssm get-parameter --name "/aws/service/eks/optimized-ami/${K8S_VERSION}/amazon-linux-2023/x86_64/standard/recommended/image_id" --query Parameter.Value | xargs aws ec2 describe-images --query 'Images[0].Name' --image-ids | sed -r 's/^.*(v[[:digit:]]+).*$/\1/')"

# 확인
echo "${KARPENTER_NAMESPACE}" "${KARPENTER_VERSION}" "${K8S_VERSION}" "${CLUSTER_NAME}" "${AWS_DEFAULT_REGION}" "${AWS_ACCOUNT_ID}" "${TEMPOUT}" "${ALIAS_VERSION}"

# CloudFormation 스택으로 IAM Policy/Role, SQS, Event/Rule 생성 : 3분 정도 소요
## IAM Policy : KarpenterControllerPolicy-gasida-karpenter-demo
## IAM Role : KarpenterNodeRole-gasida-karpenter-demo
curl -fsSL https://raw.githubusercontent.com/aws/karpenter-provider-aws/v"${KARPENTER_VERSION}"/website/content/en/preview/getting-started/getting-started-with-karpenter/cloudformation.yaml  > "${TEMPOUT}" \
&& aws cloudformation deploy \
  --stack-name "Karpenter-${CLUSTER_NAME}" \
  --template-file "${TEMPOUT}" \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides "ClusterName=${CLUSTER_NAME}"


# 클러스터 생성 : EKS 클러스터 생성 15분 정도 소요
eksctl create cluster -f - <<EOF
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ${CLUSTER_NAME}
  region: ${AWS_DEFAULT_REGION}
  version: "${K8S_VERSION}"
  tags:
    karpenter.sh/discovery: ${CLUSTER_NAME}

iam:
  withOIDC: true
  podIdentityAssociations:
  - namespace: "${KARPENTER_NAMESPACE}"
    serviceAccountName: karpenter
    roleName: ${CLUSTER_NAME}-karpenter
    permissionPolicyARNs:
    - arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:policy/KarpenterControllerPolicy-${CLUSTER_NAME}

iamIdentityMappings:
- arn: "arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/KarpenterNodeRole-${CLUSTER_NAME}"
  username: system:node:{{EC2PrivateDNSName}}
  groups:
  - system:bootstrappers
  - system:nodes
  ## If you intend to run Windows workloads, the kube-proxy group should be specified.
  # For more information, see https://github.com/aws/karpenter/issues/5099.
  # - eks:kube-proxy-windows

managedNodeGroups:
- instanceType: m5.large
  amiFamily: AmazonLinux2023
  name: ${CLUSTER_NAME}-ng
  desiredCapacity: 2
  minSize: 1
  maxSize: 10
  iam:
    withAddonPolicies:
      externalDNS: true

addons:
- name: eks-pod-identity-agent
EOF


# eks 배포 확인
eksctl get cluster
NAME            REGION          EKSCTL CREATED
karpenter-demo  ap-northeast-2  True

eksctl get nodegroup --cluster $CLUSTER_NAME
CLUSTER         NODEGROUP               STATUS  CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID                ASG NAME                                                   TYPE
karpenter-demo  karpenter-demo-ng       ACTIVE  2025-03-06T15:38:46Z    1               10              2                       m5.large        AL2023_x86_64_STANDARD  eks-karpenter-demo-ng-96cab60d-f4b7-28dd-a83d-8366de887a29 managed


# k8s 확인
kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone
NAME                                                STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE   CAPACITYTYPE   ZONE
ip-192-168-33-227.ap-northeast-2.compute.internal   Ready    <none>   5m25s   v1.32.1-eks-5d632ec   m5.large        ON_DEMAND      ap-northeast-2a
ip-192-168-91-227.ap-northeast-2.compute.internal   Ready    <none>   5m25s   v1.32.1-eks-5d632ec   m5.large        ON_DEMAND      ap-northeast-2b

kubectl get po -A
NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE
kube-system   aws-node-9nppw                    2/2     Running   0          5m30s
kube-system   aws-node-x9ffn                    2/2     Running   0          5m30s
kube-system   coredns-844d8f59bb-j9jf9          1/1     Running   0          9m33s
kube-system   coredns-844d8f59bb-pqgpf          1/1     Running   0          9m33s
kube-system   eks-pod-identity-agent-bnshb      1/1     Running   0          5m30s
kube-system   eks-pod-identity-agent-f49wd      1/1     Running   0          5m30s
kube-system   kube-proxy-qqtss                  1/1     Running   0          5m29s
kube-system   kube-proxy-vk86h                  1/1     Running   0          5m30s
kube-system   metrics-server-74b6cb4f8f-dg8qk   1/1     Running   0          9m35s
kube-system   metrics-server-74b6cb4f8f-rkrhr   1/1     Running   0          9m35s

실습 과정에서 노드 생성을 확인하기 위해서 kube-ops-view를 추가로 설치하겠습니다.

# kube-ops-view
helm repo add geek-cookbook https://geek-cookbook.github.io/charts/
helm install kube-ops-view geek-cookbook/kube-ops-view --version 1.2.2 --set service.main.type=LoadBalancer --set env.TZ="Asia/Seoul" --namespace kube-system

# 접속
echo -e "http://$(kubectl get svc -n kube-system kube-ops-view -o jsonpath="{.status.loadBalancer.ingress[0].hostname}"):8080/#scale=1.5"

이제 Karpenter를 설치해 보겠습니다.

# Logout of helm registry to perform an unauthenticated pull against the public ECR
helm registry logout public.ecr.aws

# Karpenter 설치를 위한 변수 설정 및 확인
export CLUSTER_ENDPOINT="$(aws eks describe-cluster --name "${CLUSTER_NAME}" --query "cluster.endpoint" --output text)"
export KARPENTER_IAM_ROLE_ARN="arn:${AWS_PARTITION}:iam::${AWS_ACCOUNT_ID}:role/${CLUSTER_NAME}-karpenter"
echo "${CLUSTER_ENDPOINT} ${KARPENTER_IAM_ROLE_ARN}"

# karpenter 설치
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter --version "${KARPENTER_VERSION}" --namespace "${KARPENTER_NAMESPACE}" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

# 확인
helm list -n kube-system
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
karpenter       kube-system     1               2025-03-07 00:48:49.238176978 +0900 KST deployed        karpenter-1.2.1         1.2.1
kube-ops-view   kube-system     1               2025-03-07 00:47:14.936078967 +0900 KST deployed        kube-ops-view-1.2.2     20.4.0

kubectl get pod -n $KARPENTER_NAMESPACE |grep karpenter
karpenter-5bdb74ddd6-kx7bq        1/1     Running   0          113s
karpenter-5bdb74ddd6-qpzvh        1/1     Running   0          113s

kubectl get crd | grep karpenter
ec2nodeclasses.karpenter.k8s.aws             2025-03-06T15:48:48Z
nodeclaims.karpenter.sh                      2025-03-06T15:48:48Z
nodepools.karpenter.sh                       2025-03-06T15:48:48Z

Nodepool과 EC2NodeClass를 생성합니다.

# 변수 확인
echo $ALIAS_VERSION
v20250228

# NodePool, EC2NodeClass 생성
cat <<EOF | envsubst | kubectl apply -f -
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h # 30 * 24h = 720h
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: "KarpenterNodeRole-${CLUSTER_NAME}" # replace with your cluster name
  amiSelectorTerms:
    - alias: "al2023@${ALIAS_VERSION}" # ex) al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}" # replace with your cluster name
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}" # replace with your cluster name
EOF

# 확인 (nodeclaim은 없음)
kubectl get nodepool,ec2nodeclass,nodeclaims
NAME                            NODECLASS   NODES   READY   AGE
nodepool.karpenter.sh/default   default     0       True    12s

NAME                                     READY   AGE
ec2nodeclass.karpenter.k8s.aws/default   True    12s

여기서 NodePool과 NodeClass는 아래와 같은 의미를 가지고 있습니다.

NodePool: 노드 그룹의 구성과 동작 정의(노드의 선택 기준/바운더리 정의). 예를 들어, 인스턴스 유형, 용량 유형, 워커노드의 Spec에 대한 요구사항, 스케일링 정책, 노드 수명 주기 관리 -> 어떤 노드가 필요한 지 정의
NodeClass: EC2 인스턴스의 구체적인 설정. 예를 들어, 노드 이미지, 서브넷, 보안 그룹 설정, IAM 역할, 태그 -> 노드를 AWS에서 어떻게 생성할지 정의

이때 Karpenter는 NodeClaim라는 오브젝트를 통해 노드를 생성하고 관리합니다. Karpenter는 NodePool과 NodeClass를 모니터링하고, 새로운 파드의 요구사항이 기존 노드의 리소스나 조건과 맞지 않을 때, NodeClaim 생성하여 적절한 사양의 새로운 노드를 프로비저닝합니다. 결국 쿠버네티스에서 각 노드는 고유한 NodeClaim과 1:1로 맵핑됩니다.

이러한 절차를 아래 그림과 같이 확인할 수 있습니다.

출처: https://karpenter.sh/docs/concepts/nodeclaims/

노드의 생성단계는 아래와 같이 진행됩니다.

참고: https://repost.aws/ko/articles/ARLmKuAa3FT9yMjdpq9krOTg/karpenter-%EB%A1%9C%EA%B7%B8%EB%A5%BC-%ED%86%B5%ED%95%B4-eks-worker-node%EC%9D%98-lifecycle-event%EB%A5%BC-%EC%B6%94%EC%B6%9C%ED%95%98%EA%B3%A0-%ED%95%9C%EB%88%88%EC%97%90-%ED%8C%8C%EC%95%85%ED%95%98%EB%8A%94-%EB%B0%A9%EC%95%88

Create NodeClaim: Karpenter는 배포(Provisioning) 혹은 중단(Disrupting) 요구에 따라 새로운 NodeClaim을 생성합니다.
Launch NodeClaim: AWS에 새로운 EC2 Instance를 생성하기 위해 CreateFleet API를 호출합니다.
Register NodeClaim: EC2 Instance가 생성되고 Cluster에 등록된 Node를 NodeClaim과 연결합니다.
Initialize NodeClaim: Node가 Ready 상태가 될 때까지 기다립니다.

Karpenter는 모든 단계별 작업이 완료된 후 작업의 세부 내용을 시스템 로그에 기록하며, 아래는 해당 로그의 예시입니다.

## Create NodeClaim
{"level":"INFO","time":"2024-12-31T09:50:28.720Z","logger":"controller","message":"created nodeclaim","commit":"0a85efb","controller":"provisioner","namespace":"","name":"","reconcileID":"63c2695c-4c54-4a9b-9b64-1804d9ddbb82","NodePool":{"name":"default"},"NodeClaim":{"name":"default-abcde"},"requests":{"cpu":"1516m","memory":"1187Mi","pods":"17"},"instance-types":"c4.large, c4.xlarge, c5.large, c5.xlarge, c5a.2xlarge and 55 other(s)"}

이제 테스트를 위해서 샘플 애플리케이션을 배포하겠습니다.

# pause 파드 1개에 CPU 1개 최소 보장 할당할 수 있게 디플로이먼트 배포 (현재는 replicas:0)
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      securityContext:
        runAsUser: 1000
        runAsGroup: 3000
        fsGroup: 2000
      containers:
      - name: inflate
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        resources:
          requests:
            cpu: 1
        securityContext:
          allowPrivilegeEscalation: false
EOF

# Scale up
kubectl scale deployment inflate --replicas 5; date
deployment.apps/inflate scaled
Fri Mar  7 00:56:59 KST 2025

Karpenter에 의해서 노드가 생성되는 과정을 추가로 확인해보겠습니다.

# karpenter 파드의 로그 확인
kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller
...
{"level":"INFO","time":"2025-03-06T15:57:00.326Z","logger":"controller","message":"found provisionable pod(s)","commit":"058c665","controller":"provisioner","namespace":"","name":"","reconcileID":"529ce301-0064-436f-9275-6020da23c7b5","Pods":"default/inflate-5c5f75666d-gbgst, default/inflate-5c5f75666d-p6zt9, default/inflate-5c5f75666d-85csz, default/inflate-5c5f75666d-fjkhh, default/inflate-5c5f75666d-pncp9","duration":"74.997844ms"}
# 파드에 적합한 nodeclaim을 위한 계산에 들어감
{"level":"INFO","time":"2025-03-06T15:57:00.326Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"058c665","controller":"provisioner","namespace":"","name":"","reconcileID":"529ce301-0064-436f-9275-6020da23c7b5","nodeclaims":1,"pods":5}
# nodeclaim을 생성
{"level":"INFO","time":"2025-03-06T15:57:00.344Z","logger":"controller","message":"created nodeclaim","commit":"058c665","controller":"provisioner","namespace":"","name":"","reconcileID":"529ce301-0064-436f-9275-6020da23c7b5","NodePool":{"name":"default"},"NodeClaim":{"name":"default-n4xc5"},"requests":{"cpu":"5150m","pods":"8"},"instance-types":"c4.2xlarge, c4.4xlarge, c5.2xlarge, c5.4xlarge, c5a.2xlarge and 55 other(s)"}
# nodeclaim을 Lauch
{"level":"INFO","time":"2025-03-06T15:57:02.422Z","logger":"controller","message":"launched nodeclaim","commit":"058c665","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-n4xc5"},"namespace":"","name":"default-n4xc5","reconcileID":"d273db5c-0284-4fd1-9246-6d68fcb0c06b","provider-id":"aws:///ap-northeast-2a/i-0068e4889e1e71961","instance-type":"c5a.2xlarge","zone":"ap-northeast-2a","capacity-type":"on-demand","allocatable":{"cpu":"7910m","ephemeral-storage":"17Gi","memory":"14162Mi","pods":"58","vpc.amazonaws.com/pod-eni":"38"}}
# nodeclaim을 Register
{"level":"INFO","time":"2025-03-06T15:57:21.500Z","logger":"controller","message":"registered nodeclaim","commit":"058c665","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-n4xc5"},"namespace":"","name":"default-n4xc5","reconcileID":"e49f377e-7e6c-4969-8231-b3b2657bd624","provider-id":"aws:///ap-northeast-2a/i-0068e4889e1e71961","Node":{"name":"ip-192-168-149-58.ap-northeast-2.compute.internal"}}
# nodeclaim을 initilized
{"level":"INFO","time":"2025-03-06T15:57:31.030Z","logger":"controller","message":"initialized nodeclaim","commit":"058c665","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"default-n4xc5"},"namespace":"","name":"default-n4xc5","reconcileID":"e3481aac-a971-49ab-b670-bd8c788faff7","provider-id":"aws:///ap-northeast-2a/i-0068e4889e1e71961","Node":{"name":"ip-192-168-149-58.ap-northeast-2.compute.internal"},"allocatable":{"cpu":"7910m","ephemeral-storage":"18181869946","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"15140112Ki","pods":"58"}}
..

# json으로 확인 가능
kubectl logs -f -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller | jq '.'

kubectl logs -n "${KARPENTER_NAMESPACE}" -l app.kubernetes.io/name=karpenter -c controller | grep 'launched nodeclaim' | jq '.'
{
  "level": "INFO",
  "time": "2025-03-06T15:57:02.422Z",
  "logger": "controller",
  "message": "launched nodeclaim",
  "commit": "058c665",
  "controller": "nodeclaim.lifecycle",
  "controllerGroup": "karpenter.sh",
  "controllerKind": "NodeClaim",
  "NodeClaim": {
    "name": "default-n4xc5"
  },
  "namespace": "",
  "name": "default-n4xc5",
  "reconcileID": "d273db5c-0284-4fd1-9246-6d68fcb0c06b",
  "provider-id": "aws:///ap-northeast-2a/i-0068e4889e1e71961",
  "instance-type": "c5a.2xlarge",
  "zone": "ap-northeast-2a",
  "capacity-type": "on-demand",
  "allocatable": {
    "cpu": "7910m",
    "ephemeral-storage": "17Gi",
    "memory": "14162Mi",
    "pods": "58",
    "vpc.amazonaws.com/pod-eni": "38"
  }
}

# 노드 모니터링 
kubectl scale deployment inflate --replicas 5; date
deployment.apps/inflate scaled
Fri Mar  7 00:56:59 KST 2025

while true; do date; kubectl get node; echo "------------------------------" ; sleep 5; done
...
# 노드 생성 전
------------------------------
Fri Mar  7 00:57:16 KST 2025
NAME                                                STATUS   ROLES    AGE   VERSION
ip-192-168-33-227.ap-northeast-2.compute.internal   Ready    <none>   17m   v1.32.1-eks-5d632ec
ip-192-168-91-227.ap-northeast-2.compute.internal   Ready    <none>   17m   v1.32.1-eks-5d632ec
------------------------------
# 노드 추가: 24초
Fri Mar  7 00:57:23 KST 2025
NAME                                                STATUS     ROLES    AGE   VERSION
ip-192-168-149-58.ap-northeast-2.compute.internal   NotReady   <none>   4s    v1.32.1-eks-5d632ec
ip-192-168-33-227.ap-northeast-2.compute.internal   Ready      <none>   17m   v1.32.1-eks-5d632ec
ip-192-168-91-227.ap-northeast-2.compute.internal   Ready      <none>   17m   v1.32.1-eks-5d632ec
------------------------------
# 노드 Ready: 31초
Fri Mar  7 00:57:30 KST 2025
NAME                                                STATUS   ROLES    AGE   VERSION
ip-192-168-149-58.ap-northeast-2.compute.internal   Ready    <none>   11s   v1.32.1-eks-5d632ec
ip-192-168-33-227.ap-northeast-2.compute.internal   Ready    <none>   17m   v1.32.1-eks-5d632ec
ip-192-168-91-227.ap-northeast-2.compute.internal   Ready    <none>   17m   v1.32.1-eks-5d632ec


# nodeClaim이 생성된다.
kubectl get nodeclaims -w
NAME            TYPE   CAPACITY   ZONE   NODE   READY   AGE
default-n4xc5                                           0s
default-n4xc5                                           0s
default-n4xc5                                   Unknown   0s
default-n4xc5   c5a.2xlarge   on-demand   ap-northeast-2a          Unknown   2s
default-n4xc5   c5a.2xlarge   on-demand   ap-northeast-2a          Unknown   2s
default-n4xc5   c5a.2xlarge   on-demand   ap-northeast-2a   ip-192-168-149-58.ap-northeast-2.compute.internal   Unknown   21s
default-n4xc5   c5a.2xlarge   on-demand   ap-northeast-2a   ip-192-168-149-58.ap-northeast-2.compute.internal   Unknown   22s
default-n4xc5   c5a.2xlarge   on-demand   ap-northeast-2a   ip-192-168-149-58.ap-northeast-2.compute.internal   Unknown   30s
default-n4xc5   c5a.2xlarge   on-demand   ap-northeast-2a   ip-192-168-149-58.ap-northeast-2.compute.internal   True      31s
default-n4xc5   c5a.2xlarge   on-demand   ap-northeast-2a   ip-192-168-149-58.ap-northeast-2.compute.internal   True      36s


# nodeClaim 확인
kubectl describe nodeclaims
Name:         default-n4xc5
Namespace:
Labels:       karpenter.k8s.aws/ec2nodeclass=default
              karpenter.k8s.aws/instance-category=c
              karpenter.k8s.aws/instance-cpu=8
              karpenter.k8s.aws/instance-cpu-manufacturer=amd
              karpenter.k8s.aws/instance-cpu-sustained-clock-speed-mhz=3300
              karpenter.k8s.aws/instance-ebs-bandwidth=3170
              karpenter.k8s.aws/instance-encryption-in-transit-supported=true
              karpenter.k8s.aws/instance-family=c5a
              karpenter.k8s.aws/instance-generation=5
              karpenter.k8s.aws/instance-hypervisor=nitro
              karpenter.k8s.aws/instance-memory=16384
              karpenter.k8s.aws/instance-network-bandwidth=2500
              karpenter.k8s.aws/instance-size=2xlarge
              karpenter.sh/capacity-type=on-demand
              karpenter.sh/nodepool=default
              kubernetes.io/arch=amd64
              kubernetes.io/os=linux
              node.kubernetes.io/instance-type=c5a.2xlarge
              topology.k8s.aws/zone-id=apne2-az1
              topology.kubernetes.io/region=ap-northeast-2
              topology.kubernetes.io/zone=ap-northeast-2a
Annotations:  compatibility.karpenter.k8s.aws/cluster-name-tagged: true
              karpenter.k8s.aws/ec2nodeclass-hash: 15535182697325354914
              karpenter.k8s.aws/ec2nodeclass-hash-version: v4
              karpenter.k8s.aws/tagged: true
              karpenter.sh/nodepool-hash: 6821555240594823858
              karpenter.sh/nodepool-hash-version: v3
API Version:  karpenter.sh/v1
Kind:         NodeClaim
Metadata:
  Creation Timestamp:  2025-03-06T15:57:00Z
  Finalizers:
    karpenter.sh/termination
  Generate Name:  default-
  Generation:     1
  Owner References:
    API Version:           karpenter.sh/v1
    Block Owner Deletion:  true
    Kind:                  NodePool
    Name:                  default
    UID:                   9342267c-6f75-488c-b067-9005999e31ef
  Resource Version:        5525
  UID:                     3bd4c5ab-c393-4b28-bb46-96531f0d1fc8
Spec:
  Expire After:  720h
  Node Class Ref:
    Group:  karpenter.k8s.aws
    Kind:   EC2NodeClass
    Name:   default
  Requirements:
    Key:       karpenter.sh/nodepool
    Operator:  In
    Values:
      default
    Key:       node.kubernetes.io/instance-type
    Operator:  In
    Values:
      c4.2xlarge
      c4.4xlarge
      c5.2xlarge
      c5.4xlarge
      c5a.2xlarge
      c5a.4xlarge
      c5a.8xlarge
      c5d.2xlarge
      c5d.4xlarge
      c5n.2xlarge
      c5n.4xlarge
      c6i.2xlarge
      c6i.4xlarge
      c6id.2xlarge
      c6id.4xlarge
      c6in.2xlarge
      c6in.4xlarge
      c7i-flex.2xlarge
      c7i-flex.4xlarge
      c7i.2xlarge
      c7i.4xlarge
      m4.2xlarge
      m4.4xlarge
      m5.2xlarge
      m5.4xlarge
      m5a.2xlarge
      m5a.4xlarge
      m5ad.2xlarge
      m5ad.4xlarge
      m5d.2xlarge
      m5d.4xlarge
      m5zn.2xlarge
      m5zn.3xlarge
      m6i.2xlarge
      m6i.4xlarge
      m6id.2xlarge
      m6id.4xlarge
      m7i-flex.2xlarge
      m7i-flex.4xlarge
      m7i.2xlarge
      m7i.4xlarge
      r3.2xlarge
      r4.2xlarge
      r4.4xlarge
      r5.2xlarge
      r5.4xlarge
      r5a.2xlarge
      r5a.4xlarge
      r5ad.2xlarge
      r5ad.4xlarge
      r5b.2xlarge
      r5d.2xlarge
      r5d.4xlarge
      r5dn.2xlarge
      r5n.2xlarge
      r6i.2xlarge
      r6i.4xlarge
      r6id.2xlarge
      r7i.2xlarge
      r7i.4xlarge
    Key:       kubernetes.io/os
    Operator:  In
    Values:
      linux
    Key:       karpenter.sh/capacity-type
    Operator:  In
    Values:
      on-demand
    Key:       karpenter.k8s.aws/instance-category
    Operator:  In
    Values:
      c
      m
      r
    Key:       karpenter.k8s.aws/instance-generation
    Operator:  Gt
    Values:
      2
    Key:       kubernetes.io/arch
    Operator:  In
    Values:
      amd64
    Key:       karpenter.k8s.aws/ec2nodeclass
    Operator:  In
    Values:
      default
  Resources:
    Requests:
      Cpu:   5150m
      Pods:  8
Status:
  Allocatable:
    Cpu:                        7910m
    Ephemeral - Storage:        17Gi
    Memory:                     14162Mi
    Pods:                       58
    vpc.amazonaws.com/pod-eni:  38
  Capacity:
    Cpu:                        8
    Ephemeral - Storage:        20Gi
    Memory:                     15155Mi
    Pods:                       58
    vpc.amazonaws.com/pod-eni:  38
  Conditions:
    Last Transition Time:  2025-03-06T15:57:02Z
    Message:
    Observed Generation:   1
    Reason:                Launched
    Status:                True
    Type:                  Launched
    Last Transition Time:  2025-03-06T15:57:21Z
    Message:
    Observed Generation:   1
    Reason:                Registered
    Status:                True
    Type:                  Registered
    Last Transition Time:  2025-03-06T15:57:31Z
    Message:
    Observed Generation:   1
    Reason:                Initialized
    Status:                True
    Type:                  Initialized
    Last Transition Time:  2025-03-06T15:58:36Z
    Message:
    Observed Generation:   1
    Reason:                Consolidatable
    Status:                True
    Type:                  Consolidatable
    Last Transition Time:  2025-03-06T15:57:31Z
    Message:
    Observed Generation:   1
    Reason:                Ready
    Status:                True
    Type:                  Ready
  Image ID:                ami-089f1bf55c5291efd
  Last Pod Event Time:     2025-03-06T15:57:36Z
  Node Name:               ip-192-168-149-58.ap-northeast-2.compute.internal
  Provider ID:             aws:///ap-northeast-2a/i-0068e4889e1e71961
Events:
  Type    Reason             Age    From       Message
  ----    ------             ----   ----       -------
  Normal  Launched           4m35s  karpenter  Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched
  Normal  DisruptionBlocked  4m31s  karpenter  Nodeclaim does not have an associated node
  Normal  Registered         4m16s  karpenter  Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered
  Normal  Initialized        4m6s   karpenter  Status condition transitioned, Type: Initialized, Status: Unknown -> True, Reason: Initialized
  Normal  Ready              4m6s   karpenter  Status condition transitioned, Type: Ready, Status: Unknown -> True, Reason: Ready
  Normal  Unconsolidatable   3m     karpenter  Can't replace with a cheaper node

Karpenter는 노드 용량 추적을 위해 클러스터의 CloudProvider 머신과 CustomResources 간의 매핑을 만듭니다. 이 매핑이 일관되도록 하기 위해 Karpenter는 다음 태그 키를 활용합니다.

karpenter.sh/managed-by
karpenter.sh/nodepool
kubernetes.io/cluster/${CLUSTER_NAME}

Karpenter에 의해 등록된 노드에 추가 라벨이 등록된 것이 확인됩니다.

kubectl get node -l karpenter.sh/registered=true -o jsonpath="{.items[0].metadata.labels}" | jq '.'
...
  "karpenter.sh/initialized": "true",
  "karpenter.sh/nodepool": "default",
  "karpenter.sh/registered": "true",
...

생성된 노드 ip-192-168-149-58.ap-northeast-2.compute.interna는 기존 노드와 다른 c5a.2xlarge로 생성된 것을 확인할 수 있습니다.

kubectl get no
NAME                                                STATUS   ROLES    AGE   VERSION
ip-192-168-149-58.ap-northeast-2.compute.internal   Ready    <none>   11m   v1.32.1-eks-5d632ec
ip-192-168-33-227.ap-northeast-2.compute.internal   Ready    <none>   29m   v1.32.1-eks-5d632ec
ip-192-168-91-227.ap-northeast-2.compute.internal   Ready    <none>   29m   v1.32.1-eks-5d632ec

웹 콘솔에서 확인하였습니다.

Karpenter는 스케줄링이 필요한 모든 파드를 수용할 수 있는 하나의 노드를 생성하였고, 또한 CA에 비해서 더 빠른 프로비저닝 속도를 확인할 수 있었습니다.

Pending Pod 발생

노드 생성 이후

Karpenter 실습을 마무리하고 리소스를 정리하겠습니다.

# Karpenter helm 삭제 
helm uninstall karpenter --namespace "${KARPENTER_NAMESPACE}"

# Karpenter IAM Role 등 생성한 CloudFormation 삭제
aws cloudformation delete-stack --stack-name "Karpenter-${CLUSTER_NAME}"

# EC2 Launch Template 삭제
aws ec2 describe-launch-templates --filters "Name=tag:karpenter.k8s.aws/cluster,Values=${CLUSTER_NAME}" |
    jq -r ".LaunchTemplates[].LaunchTemplateName" |
    xargs -I{} aws ec2 delete-launch-template --launch-template-name {}

# 클러스터 삭제
eksctl delete cluster --name "${CLUSTER_NAME}"

참고로 디플로이먼트를 스케일링 다운해 Karpenter에 의해 생성된 노드가 삭제된 이후 클러스터를 삭제하셔야 합니다.

바로 클러스터를 삭제하니 Karpenter에 의해 생성된 노드는 삭제되지 않고 EC2 인스턴스에 남아 있는 현상을 발견했습니다. 아무래도 Karpenter에서 생성된 노드이다보니, EKS가 직접 관리하는 리소스로 정리가 되지 않는 것으로 보입니다. 먼저 스케일링 다운으로 Karpenter에 의해 생성된 노드가 삭제된 후 클러스터 삭제를 진행을 하셔야 합니다.

추가로 클러스터 삭제 이후에도, CloudFormation 생성한 Karpenter IAM Role이 삭제안될 경우 AWS CloudFormation 관리 콘솔에서 직접 삭제하시기 바랍니다.

3. AKS의 오토스케일링

AKS에서도 EKS의 오토스케일링 옵션에 대응하는 솔루션을 제공하고 있습니다.

앞서 살펴본바와 같이 EKS의 오토스케일링 옵션은 사용자가 직접 해당 컴포넌트를 설치하는 방식으로 제공되고 있습니다.

AKS에서는 오토스케일링 옵션을 애드온 혹은 기능으로 제공하고 있기 때문에 클러스터 생성 시점에 필요한 옵션을 사용하면 해당 기능을 사용할 수 있습니다(혹은 설치된 클러스터에 기능을 활성화할 수 있음).

앞서 살펴본 바와 같이 HPA는 쿠버네티스 환경에서 기본으로 제공되기 때문에 어떤 환경에 있는 쿠버네티스에서도 사용이 가능합니다. 그 외 AKS에서 제공하는 나머지 오토스케일링 기능에 대해 아래와 같습니다.

KEDA: add-on으로 제공

https://learn.microsoft.com/en-us/azure/aks/keda-about

클러스터 옵션의 --enable-keda 옵션을 통해서 활성화 할 수 있습니다.

az aks create --resource-group myResourceGroup --name myAKSCluster --enable-keda --generate-ssh-keys

VPA: `--enable-vpa` 옵션

https://learn.microsoft.com/en-us/azure/aks/use-vertical-pod-autoscaler

클러스터 옵션의 --enable-vpa 옵션을 통해서 VPA 기능을 활성화 할 수 있습니다.

az aks create --resource-group myResourceGroup --name myAKSCluster --enable-keda --generate-ssh-keys

Cluster Autoscaler: `--enable-cluster-autoscaler` 옵션

https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler?tabs=azure-cli

클러스터 옵션의 --enable-cluster-autoscaler 으로 활성화할 수 있으며, --min-count 와 --max-count으로 최소/최대 값을 지정할 수 있습니다.

az aks create --resource-group myResourceGroup --name myAKSCluster --node-count 1 --vm-set-type VirtualMachineScaleSets --load-balancer-sku standard --enable-cluster-autoscaler --min-count 1 --max-count 3 --generate-ssh-keys

추가로 CA의 scan interval, expander와 같은 옵션을 cluster autoscaler profile로 정의할 수 있습니다.

아래 문서를 통해 지원 가능한 옵션을 살펴보실 수 있습니다.

https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler?tabs=azure-cli#use-the-cluster-autoscaler-profile

Karpenter: NAP(Node Autoprovisioning)으로 활성화

https://learn.microsoft.com/en-us/azure/aks/node-autoprovision?tabs=azure-cli

클러스터 옵션의 --node-provisioning-mode Auto 사용하여 Node Autoprovisioning 을 활성화 할 수 있습니다. NAP는 2025년 03월 기준 Preview 상태입니다.

az aks create --name $CLUSTER_NAME --resource-group $RESOURCE_GROUP_NAME --node-provisioning-mode Auto --network-plugin azure --network-plugin-mode overlay --network-dataplane cilium --generate-ssh-keys

4. 오토스케일링에 대한 주의사항

이러한 오토스케일링을 CSP에서 사용할 때는 아래와 같은 일반적인 주의사항이 있습니다.

파드 스케일링에 사용되는 VPA, HPA를 동시에 사용하는 것은 권장되지 않습니다.
VPA로 신규로 생성된 파드는 사용 가능한 리소스를 초과할 수 있고 파드를 Pending 상태로 만들 수 있습니다. 이 때문에 VPA는 CA와 함께 사용해야 할 수 있습니다. (혹은 VPA를 off 모드로 사용하고 적합한 사이징을 위해서 사용하실 수도 있습니다)
CA와 Karpenter를 동시에 사용하지 말아야 합니다.
CA를 가상 머신 스케일링 메커니즘(예를 들어, CPU 사용량에 따른 가상머신 스케일링을 설정)과 동시에 설정하지 말아야합니다. 이는 의도치 않은 결과를 만들어 낼 수 있습니다.
노드 스케일링 옵션에서 Scale down은 의도치 않은 파드의 eviction을 발생시킬 수 있으므로, 필요한 경우 파드 내 annotation으로 evict를 하지 않도록 설정하거나(cluster-autoscaler.kubernetes.io/safe-to-evict="false"), 혹은 PDB(Pod Distruption Budget)으로 안정적인 eviction을 유도할 수 있습니다.
빈번한 Scale up/down이 발생하는 경우 오히려 애플리케이션의 안전성이 무너질 수 있으므로 모니터링을 통해 리소스 사용을 안정화 할 필요가 있습니다. 혹은 Scale up/down에 조정 시간을 주는 옵션을 검토해야 합니다.
노드 스케일링으로 API 요청이 빈번하게 발생하는 경우 API throttling이 발생할 수 있고, 이 경우 요청이 정상 처리 되지 않을 수 있는 점도 유의하실 필요가 있습니다.

마무리

금번 포스트에서는 EKS의 오토스케일링 옵션을 살펴보고 AKS와 비교해 보았습니다.

EKS의 특성이 기본 구성이 최소화된 점과, 한편으로 사용자에게 자율성을 주는 것으로도 이해할 수 있습니다. 오토스케일링 또한 사용자가 직접 컴포넌트를 구성해야 하는데, 이러한 과정에서 사용자가 설치를 제대로 하지 못하거나 정확한 기능을 이해하지 못하는 경우 오토스케일링이 제대로 동작하지 않을 수 있습니다. 또한 해당 컴포넌트의 업그레이드도 사용자의 몫입니다.

반면 관련 컴포넌트를 사용자 데이터 플레인에 위치 시키므로 해당 컴포넌트의 동작을 이해하고, 이슈를 직접 트러블 슈팅할 수 있습니다. 또한 오픈소스 컴포넌트를 그대로 사용하기 때문에 다양한 옵션을 활용할 수 있습니다. 이러한 측면에서 EKS의 환경은 가볍지만 상당 부분을 고객이 직접 구성하므로 고급 사용자에게 적합하지 않은가라는 생각도 들기는 합니다.

AKS는 오토스케일링 옵션을 Managed Service의 일부로 제공합니다. 클러스터에서 VPA, CAS, KEDA를 활성화 하는 옵션을 제공하고 있으며, 최근 Karpenter를 NAP(Node Auto Provisioning)라는 이름으로 Preview로 제공하고 있습니다.

이로써 해당 기능에 대한 개념을 이해하는 일반 사용자 또한 쉽게 애드온으로 기능을 사용할 수 있으며, 애드온으로 제공된다는 것은 해당 컴포넌트의 라이프사이클을 AKS에서 직접 관리해주기 때문에 관리 편의성이 높습니다.

다만 해당 구성에서 제공되는 옵션 또한 검증된 부분만 제공하기 때문에 오픈 소스의 모든 옵션을 제공하지 않을 수 있으므로 Limitation을 확인하셔야 합니다. 또한 컴포넌트들이 컨트롤 플레인 영역에 배치되어 직접 트러블 슈팅을 하는데 제한이 있을 수 있습니다. 한편 Managed Service로 기능이 제공되기 때문에 옵션 추가 등에서 오픈 소스의 기능을 빠르게 따라가지 못하는 점도 아쉬운 점으로 남을 수 있습니다.

다른 측면으로 한가지를 언급 드릴 부분은, EKS를 기능적으로 지원하는 컴포넌트들은 상당한 부분이 커스터마이즈(옵션 변경, 삭제 등)가 가능합니다. 반대로 AKS의 시스템 컴포넌트는 addon Manager에 의해서 관리되어, 이러한 컴포넌트 혹은 Configmap을 살펴보면 addonmanager.kubernetes.io/mode=Reconcile 로 레이블이 지정되어 있습니다. 이는 addon manager에 의해 정기적으로 조정(reconcile)되는 리소스이기 때문에 사용자가 임의로 변경해도 다시 원복 됩니다. 즉, AKS는 허용된 방식으로만 시스템 컴포넌트를 제어할 수 있습니다. 일반적으로 매니지드 영역에 대한 수정은 권장하지 않고 있습니다.

[Note]
Addon manager에 대해서 아래의 문서를 참고 부탁드립니다.
https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/addon-manager/README.md#addon-manager

그럼 이번 포스트를 마무리 하도록 하겠습니다.

다음 포스트에서는 EKS의 보안에 대해서 학습한 내용을 작성해 보겠습니다.

저작자표시

'EKS' 카테고리의 다른 글

[7] EKS Fargate (0)	2025.03.23
[6] EKS의 Security - EKS 인증/인가와 Pod IAM 권한 할당 (0)	2025.03.16
[5-1] EKS의 오토스케일링 Part1 (0)	2025.03.07
[4] EKS의 모니터링과 로깅 (0)	2025.03.01
[3-2] EKS 노드 그룹 (0)	2025.02.23

[5-1] EKS의 오토스케일링 Part1

한명 2025. 3. 7. 00:35

2025. 3. 7. 00:35

이번 포스트에서는 EKS의 오토스케일링(Autoscaling) 옵션을 살펴보겠습니다.

기본적인 쿠버네티스 환경의 스케일링 옵션을 전반적으로 살펴보겠습니다. 이후 EKS의 오토스케일링 옵션을 살펴보고, 이를 실습을 통해 확인 해보도록 하겠습니다. 마지막으로 AKS의 오토스케일링 옵션을 EKS와 비교해 보겠습니다.

글을 작성하는 과정에서 분량이 너무 길어져, Part1에서는 HPA, KEDA, VPA까지의 내용을 다루고 Part2에서 Cluster Autoscaler 부터 이어서 설명하도록 하겠습니다.

1. 쿠버네티스 환경의 스케일링

쿠버네티스 패턴(책만, 2020)에서는 쿠버네티스 환경의 애플리케이션의 스케일링 레벨을 아래와 같이 설명하고 있습니다.

출처: 빌긴 이브리암/롤란트 후스, 안승규/서한배, 책만, 2020, 272.

먼저 애플리케이션 튜닝이 필요합니다. 쿠버네티스 환경이라고 할지라도 애플리케이션 자체가 최대한의 성능을 사용하도록 동작해야 합니다. 애플리케이션에 할당된 리소스를 증가시키거나 복제본을 증가시키는 것은 부차적인 일입니다.

두번째는 수직 파드 오토스케일러(VPA, Vertical Pod Autoscaler)입니다. 파드의 리소스가 부족할 때 설정된 리소스(CPU/MEM) 자체를 증가시키는 방식입니다. 쿠버네티스 환경에서는 request와 limit을 지정할 수 있고, 이 값이 변경됩니다.

세번째는 수평 파드 오토스케일러(HPA, Horizontal Pod Autoscaler)입니다. 파드의 리소스가 임계치 이상 사용되면, 동일한 파드의 복제본을 증가시키는 방식입니다.

네번째는 VPA나 HPA로 애플리케이션이 용량이나 갯수가 증가했을 때, 노드의 할당 가능한 자원(Allocatable resource)을 모두 소진하면 파드가 스케줄링 불가능(Unschedurable Pod)한 상황이 발생할 수 있습니다. 이때는 노드 자체를 증가시켜야 합니다. 이것을 클러스터 오토스케일러(CA, Cluster Autoscaler)에서 지원합니다.

쿠버네티스 환경은 애플리케이션의 스케일링을 하기 위해서 위와 같은 기법을 사용할 수 있으며, 이후 EKS에서는 이들을 어떤식으로 활용할 수 있는지 살펴보겠습니다.

일반적으로 애플리케이션 튜닝은 쿠버네티스에 국한되지 않은 별개의 영역이므로 설명에 제외하도록 하겠습니다.

2. EKS의 오토스케일링 개요

EKS에서는 설명한 HPA, VPA, CA를 모두 지원하고 있으며, Karpenter이라는 노드 오토스케일링을 방식을 추가로 제공하고 있습니다. 또한 그림에는 없지만 KEDA를 통해서 이벤트를 통해 HPA를 확장할 수 있는 방식도 있습니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

HPA는 쿠버네티스 환경에서 기본으로 제공되기 때문에 별도의 설치 과정 없이 hpa 오브젝트를 생성하여 사용할 수 있습니다.

EKS에서는 KEDA, VPA, CA, Karpenter는 helm이나 yaml을 배포하는 방식으로 사용자가 설치 및 구성해야 합니다. 애드온과 같은 방식이 아니라 오픈 소스를 배포하므로 직접 라이프사이클을 관리하며, 제공되는 모든 옵션을 활용할 수 있습니다.

컨트롤러에 해당하는 컴포넌트가 데이터 플레인에 직접 배포되기 때문에 동작 과정을 이해할 수 있습니다. 다만 이 또한 데이터 플레인의 리소스를 사용한다는 점과 사용자 책임의 관리가 필요한 점을 유의해야 합니다.

참고로 AKS의 KEDA, VPA, CA, Karpenter는 애드온이나 기능으로 제공되기 때문에 Managed의 영역으로 이해할 수 있습니다.

3. 실습 환경 생성

실습 환경은 아래와 같습니다.

CloudFormation을 바탕으로 실습 환경을 구성하도록 하겠습니다.

# YAML 파일 다운로드
curl -O https://s3.ap-northeast-2.amazonaws.com/cloudformation.cloudneta.net/K8S/myeks-5week.yaml

# 변수 지정
CLUSTER_NAME=myeks
SSHKEYNAME=<SSH 키 페이 이름>
MYACCESSKEY=<IAM Uesr 액세스 키>
MYSECRETKEY=<IAM Uesr 시크릿 키>

# CloudFormation 스택 배포
aws cloudformation deploy --template-file myeks-5week.yaml --stack-name $CLUSTER_NAME --parameter-overrides KeyName=$SSHKEYNAME SgIngressSshCidr=$(curl -s ipinfo.io/ip)/32  MyIamUserAccessKeyID=$MYACCESSKEY MyIamUserSecretAccessKey=$MYSECRETKEY ClusterBaseName=$CLUSTER_NAME --region ap-northeast-2

# CloudFormation 스택 배포 완료 후 작업용 EC2 IP 출력
aws cloudformation describe-stacks --stack-name myeks --query 'Stacks[*].Outputs[0].OutputValue' --output text

# EC2 접속
ssh -i ~/.ssh/ekskey.pem ec2-user@$(aws cloudformation describe-stacks --stack-name myeks --query 'Stacks[*].Outputs[0].OutputValue' --output text)

CloudFormation에서 운영서버를 배포하고, 이후 EKS 까지 배포를 하게 되어 있습니다. 대략 15~20분 가량이 소요됩니다.

그리고 생성된 EKS를 확인하고 kubeconfig을 받습니다.

# 변수 지정
CLUSTER_NAME=myeks

# 클러스터 확인
eksctl get cluster

# kubeconfig 생성
aws sts get-caller-identity --query Arn
aws eks update-kubeconfig --name $CLUSTER_NAME --user-alias <위 출력된 자격증명 사용자>

# 클러스터 기본 확인
kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone
kubectl get pod -A
NAME                                               STATUS   ROLES    AGE     VERSION               INSTANCE-TYPE   CAPACITYTYPE   ZONE
ip-192-168-1-87.ap-northeast-2.compute.internal    Ready    <none>   3m12s   v1.31.5-eks-5d632ec   t3.medium       ON_DEMAND      ap-northeast-2a
ip-192-168-2-195.ap-northeast-2.compute.internal   Ready    <none>   3m9s    v1.31.5-eks-5d632ec   t3.medium       ON_DEMAND      ap-northeast-2b
ip-192-168-3-136.ap-northeast-2.compute.internal   Ready    <none>   3m8s    v1.31.5-eks-5d632ec   t3.medium       ON_DEMAND      ap-northeast-2c

kubectl get pod -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   aws-node-b7vsh                        2/2     Running   0          3m16s
kube-system   aws-node-fnrj8                        2/2     Running   0          3m13s
kube-system   aws-node-ltkxn                        2/2     Running   0          3m12s
kube-system   coredns-86f5954566-96j8r              1/1     Running   0          9m10s
kube-system   coredns-86f5954566-j6dvp              1/1     Running   0          9m10s
kube-system   ebs-csi-controller-549bf6879f-6h7jg   6/6     Running   0          49s
kube-system   ebs-csi-controller-549bf6879f-p2gml   6/6     Running   0          49s
kube-system   ebs-csi-node-bm94h                    3/3     Running   0          49s
kube-system   ebs-csi-node-h5ntq                    3/3     Running   0          49s
kube-system   ebs-csi-node-n7s2s                    3/3     Running   0          49s
kube-system   kube-proxy-brf4v                      1/1     Running   0          3m16s
kube-system   kube-proxy-x7sbw                      1/1     Running   0          3m13s
kube-system   kube-proxy-zm6ht                      1/1     Running   0          3m12s
kube-system   metrics-server-6bf5998d9c-bzxn6       1/1     Running   0          9m9s
kube-system   metrics-server-6bf5998d9c-zs5mg       1/1     Running   0          9m10s

그리고 이후 실습에 활용하기 위한 일부 컴포넌트들을 설치합니다.

# 환경 변수
CERT_ARN=$(aws acm list-certificates --query 'CertificateSummaryList[].CertificateArn[]' --output text)
MyDomain=aperson.link # 각자 자신의 도메인 이름 입력
MyDnzHostedZoneId=$(aws route53 list-hosted-zones-by-name --dns-name "$MyDomain." --query "HostedZones[0].Id" --output text)

# AWS LoadBalancerController
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=$CLUSTER_NAME \
  --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

# ExternalDNS
echo $MyDomain
curl -s https://raw.githubusercontent.com/gasida/PKOS/main/aews/externaldns.yaml | MyDomain=$MyDomain MyDnzHostedZoneId=$MyDnzHostedZoneId envsubst | kubectl apply -f -

# kube-ops-view
helm repo add geek-cookbook https://geek-cookbook.github.io/charts/
helm install kube-ops-view geek-cookbook/kube-ops-view --version 1.2.2 --set service.main.type=ClusterIP  --set env.TZ="Asia/Seoul" --namespace kube-system

# kubeopsview 용 Ingress 설정 : group 설정으로 1대의 ALB를 여러개의 ingress 에서 공용 사용
echo $CERT_ARN
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/certificate-arn: $CERT_ARN
    alb.ingress.kubernetes.io/group.name: study
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
    alb.ingress.kubernetes.io/load-balancer-name: $CLUSTER_NAME-ingress-alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/success-codes: 200-399
    alb.ingress.kubernetes.io/target-type: ip
  labels:
    app.kubernetes.io/name: kubeopsview
  name: kubeopsview
  namespace: kube-system
spec:
  ingressClassName: alb
  rules:
  - host: kubeopsview.$MyDomain
    http:
      paths:
      - backend:
          service:
            name: kube-ops-view
            port:
              number: 8080  # name: http
        path: /
        pathType: Prefix
EOF

Prometheus와 Grafana도 설치를 진행합니다.

# repo 추가
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# 파라미터 파일 생성 : PV/PVC(AWS EBS) 삭제에 불편하여 PV/PVC 미사용 하도록 수정
cat <<EOT > monitor-values.yaml
prometheus:
  prometheusSpec:
    scrapeInterval: "15s"
    evaluationInterval: "15s"
    podMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false
    retention: 5d
    retentionSize: "10GiB"

  # Enable vertical pod autoscaler support for prometheus-operator
  verticalPodAutoscaler:
    enabled: true

  ingress:
    enabled: true
    ingressClassName: alb
    hosts: 
      - prometheus.$MyDomain
    paths: 
      - /*
    annotations:
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
      alb.ingress.kubernetes.io/certificate-arn: $CERT_ARN
      alb.ingress.kubernetes.io/success-codes: 200-399
      alb.ingress.kubernetes.io/load-balancer-name: myeks-ingress-alb
      alb.ingress.kubernetes.io/group.name: study
      alb.ingress.kubernetes.io/ssl-redirect: '443'

grafana:
  defaultDashboardsTimezone: Asia/Seoul
  adminPassword: xxx # Grafana 패스워드
  defaultDashboardsEnabled: false

  ingress:
    enabled: true
    ingressClassName: alb
    hosts: 
      - grafana.$MyDomain
    paths: 
      - /*
    annotations:
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/target-type: ip
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
      alb.ingress.kubernetes.io/certificate-arn: $CERT_ARN
      alb.ingress.kubernetes.io/success-codes: 200-399
      alb.ingress.kubernetes.io/load-balancer-name: myeks-ingress-alb
      alb.ingress.kubernetes.io/group.name: study
      alb.ingress.kubernetes.io/ssl-redirect: '443'

kube-state-metrics:
  rbac:
    extraRules:
      - apiGroups: ["autoscaling.k8s.io"]
        resources: ["verticalpodautoscalers"]
        verbs: ["list", "watch"]
  customResourceState:
    enabled: true
    config:
      kind: CustomResourceStateMetrics
      spec:
        resources:
          - groupVersionKind:
              group: autoscaling.k8s.io
              kind: "VerticalPodAutoscaler"
              version: "v1"
            labelsFromPath:
              verticalpodautoscaler: [metadata, name]
              namespace: [metadata, namespace]
              target_api_version: [apiVersion]
              target_kind: [spec, targetRef, kind]
              target_name: [spec, targetRef, name]
            metrics:
              - name: "vpa_containerrecommendations_target"
                help: "VPA container recommendations for memory."
                each:
                  type: Gauge
                  gauge:
                    path: [status, recommendation, containerRecommendations]
                    valueFrom: [target, memory]
                    labelsFromPath:
                      container: [containerName]
                commonLabels:
                  resource: "memory"
                  unit: "byte"
              - name: "vpa_containerrecommendations_target"
                help: "VPA container recommendations for cpu."
                each:
                  type: Gauge
                  gauge:
                    path: [status, recommendation, containerRecommendations]
                    valueFrom: [target, cpu]
                    labelsFromPath:
                      container: [containerName]
                commonLabels:
                  resource: "cpu"
                  unit: "core"
  selfMonitor:
    enabled: true

alertmanager:
  enabled: false
defaultRules:
  create: false
kubeControllerManager:
  enabled: false
kubeEtcd:
  enabled: false
kubeScheduler:
  enabled: false
prometheus-windows-exporter:
  prometheus:
    monitor:
      enabled: false
EOT
cat monitor-values.yaml

# helm 배포
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack --version 69.3.1 \
-f monitor-values.yaml --create-namespace --namespace monitoring

# helm 확인
helm get values -n monitoring kube-prometheus-stack

# PV 사용하지 않음
kubectl get pv,pvc -A

# 프로메테우스 웹 접속
echo -e "https://prometheus.$MyDomain"

# 그라파나 웹 접속
echo -e "https://grafana.$MyDomain"

# TargetGroup binding 확인
kubectl get targetgroupbindings.elbv2.k8s.aws -A
NAMESPACE     NAME                               SERVICE-NAME                       SERVICE-PORT   TARGET-TYPE   AGE
kube-system   k8s-kubesyst-kubeopsv-b2ecfd420f   kube-ops-view                      8080           ip            2m54s
monitoring    k8s-monitori-kubeprom-40399c957e   kube-prometheus-stack-grafana      80             ip            45s
monitoring    k8s-monitori-kubeprom-826f25cbb8   kube-prometheus-stack-prometheus   9090           ip            45s

4. HPA(Horizontal Pod Autoscaler)

HPA는 지정된 워크로드의 특정 메트릭이 임계치를 초과하는 경우 복제본(Replicas)를 증가시킵니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

HPA는 쿠버네티스 환경에서 기본으로 제공되기 때문에 어떤 환경에 있는 쿠버네티스에서도 사용이 가능합니다. 주의할 점은 HPA가 metrics-server에서 제공되는 core system metrics에 의해서 판단하게 되므로, metrics-server가 정상적이지 않으면 HPA가 동작하지 않는 점만 기억하시면 됩니다.

HPA는 EKS 특화된 기술이 아니기 때문에 EKS 환경의 실습은 진행하지 않으며, 다음 절에서 KEDA를 통해 HPA를 사용하는 사례를 통해 살펴보도록 하겠습니다.

HPA 실습

HPA에 대한 실습은 쿠버네티스 공식 문서를 참고하실 수 있습니다.

https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/

1) 샘플 애플리케이션

apiVersion: apps/v1
kind: Deployment
metadata:
  name: php-apache
spec:
  selector:
    matchLabels:
      run: php-apache
  template:
    metadata:
      labels:
        run: php-apache
    spec:
      containers:
      - name: php-apache
        image: registry.k8s.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  name: php-apache
  labels:
    run: php-apache
spec:
  ports:
  - port: 80
  selector:
    run: php-apache

2) HPA 정의

kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

3) Load 발생

# Run this in a separate terminal
# so that the load generation continues and you can carry on with the rest of the steps
kubectl run -i --tty load-generator --rm --image=busybox:1.28 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://php-apache; done"

이후 kubectl get hpa php-apache --watch를 통해 메트릭의 증가와 복제본의 증가를 확인하실 수 있습니다.

HPA의 메트릭 확장

HPA는 Metric API를 통해서 값을 수집하는데, 이는 쿠버네티스에서 기본 제공되는 API가 아니며, 이를 노출하기 위한 metrics-server가 필요합니다. 또한 HPA는 이미 정의된 Resource Metric에 의해서 스케일링을 판단하는데 이는 metrics-server에 의해서 기본적으로 제공됩니다.

이러한 Metrics API를 확장하기 위해서 Custom Metric, External Metric을 사용할 수 있습니다. 보통 Prometheus를 통해서 추가 메트릭을 수집하고, Prometheus에서 수집된 메트릭을 기반으로 Prometheus Adapter이 Custom Metric API Server의 역할을 해 Custom Metric와 External Metric을 노출합니다.

출처: https://itnext.io/autoscaling-apps-on-kubernetes-with-the-horizontal-pod-autoscaler-798750ab7847

이후 살펴볼 KEDA 또한 자체적으로 Metrics API Server를 가지고 External Metircs를 노출하게 됩니다.

5. KEDA(Kubernetes Event-driven Autoscaler)

HPA는 metrics-server에 의해서 수집된 CPU, Memory와 같은 메트릭을 기반으로 스케일링을 결정합니다. 이러한 리소스 기반이 아닌 다른 메트릭을 참조하여 HPA를 동작하도 도와주는 컴포넌트가 KEDA(Kubernetes Event-driven Autoscaler)입니다.

KEDA는 다양한 Event Source로 부터 발생하는 이벤트를 기반으로 스케일링 여부를 결정할 수 있습니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

EKS에서는 helm을 통해서 KEDA를 설치할 수 있습니다. (해당 실습에서 prometheus를 통해 모니터링을 하는 부분이 포함되어 있어, 사전 Prometheus가 설치되어 있어야 합니다)

# 설치 전 기존 metrics-server 제공 Metris API 확인
kubectl get --raw "/apis/metrics.k8s.io" | jq
{
  "kind": "APIGroup",
  "apiVersion": "v1",
  "name": "metrics.k8s.io",
  "versions": [
    {
      "groupVersion": "metrics.k8s.io/v1beta1",
      "version": "v1beta1"
    }
  ],
  "preferredVersion": {
    "groupVersion": "metrics.k8s.io/v1beta1",
    "version": "v1beta1"
  }
}

# external metrics는 없음
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
Error from server (NotFound): the server could not find the requested resource


# KEDA 설치
cat <<EOT > keda-values.yaml
metricsServer:
  useHostNetwork: true

prometheus:
  metricServer:
    enabled: true
    port: 9022
    portName: metrics
    path: /metrics
    serviceMonitor:
      # Enables ServiceMonitor creation for the Prometheus Operator
      enabled: true
    podMonitor:
      # Enables PodMonitor creation for the Prometheus Operator
      enabled: true
  operator:
    enabled: true
    port: 8080
    serviceMonitor:
      # Enables ServiceMonitor creation for the Prometheus Operator
      enabled: true
    podMonitor:
      # Enables PodMonitor creation for the Prometheus Operator
      enabled: true
  webhooks:
    enabled: true
    port: 8020
    serviceMonitor:
      # Enables ServiceMonitor creation for the Prometheus webhooks
      enabled: true
EOT

helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --version 2.16.0 --namespace keda --create-namespace -f keda-values.yaml

# apiservice가 생성된 것을 알 수 있습니다.
kubectl get apiservice v1beta1.external.metrics.k8s.io -o yaml
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  annotations:
    meta.helm.sh/release-name: keda
    meta.helm.sh/release-namespace: keda
  creationTimestamp: "2025-03-06T13:23:54Z"
  labels:
    app.kubernetes.io/component: operator
    app.kubernetes.io/instance: keda
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: v1beta1.external.metrics.k8s.io
    app.kubernetes.io/part-of: keda-operator
    app.kubernetes.io/version: 2.16.0
    helm.sh/chart: keda-2.16.0
  name: v1beta1.external.metrics.k8s.io
  resourceVersion: "7353"
  uid: 26d3a3b1-7487-4086-84f9-1fd3105aa89d
spec:
  caBundle: <생략>
  group: external.metrics.k8s.io
  groupPriorityMinimum: 100
  service:
    name: keda-operator-metrics-apiserver
    namespace: keda
    port: 443
  version: v1beta1
  versionPriority: 100
status:
  conditions:
  - lastTransitionTime: "2025-03-06T13:24:25Z"
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available

# 설치 후 KEDA Metrics Server의해 노출된 External Metrics를 확인합니다.
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "externalmetrics",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

KEDA를 설치한 이후 생성된 파드를 살펴보면 KEDA의 구성요소를 알 수 있는데, 각 Agent, Metrics, Admission Webhook의 역할을 합니다.

$ kubectl get pod -n keda
NAME                                                   READY   STATUS    RESTARTS     AGE
pod/keda-admission-webhooks-86cffccbf5-nq7kw           1/1     Running   0            4m11s
pod/keda-operator-6bdffdc78-zrhmg                      1/1     Running   1 (4m ago)   4m11s
pod/keda-operator-metrics-apiserver-74d844d769-2rbqk   1/1     Running   0            4m11s

공식 문서를 보면 각 컴포넌트에 해당하는 역할을 확인하실 수 있습니다.

https://keda.sh/docs/2.10/concepts/#how-keda-works

Agent — KEDA activates and deactivates Kubernetes Deployments to scale to and from zero on no events. This is one of the primary roles of the keda-operator container that runs when you install KEDA.
Metrics — KEDA acts as a Kubernetes metrics server that exposes rich event data like queue length or stream lag to the Horizontal Pod Autoscaler to drive scale out. It is up to the Deployment to consume the events directly from the source. This preserves rich event integration and enables gestures like completing or abandoning queue messages to work out of the box. The metric serving is the primary role of the keda-operator-metrics-apiserver container that runs when you install KEDA.
Admission Webhooks - Automatically validate resource changes to prevent misconfiguration and enforce best practices by using an admission controller. As an example, it will prevent multiple ScaledObjects to target the same scale target.

그리고 생성된 CRD를 확인해보면 아래와 같은 CRD가 생성된 것을 알 수 있습니다.

$ kubectl get crd | grep keda
cloudeventsources.eventing.keda.sh           2025-03-06T13:23:51Z
clustercloudeventsources.eventing.keda.sh    2025-03-06T13:23:51Z
clustertriggerauthentications.keda.sh        2025-03-06T13:23:51Z
scaledjobs.keda.sh                           2025-03-06T13:23:53Z
scaledobjects.keda.sh                        2025-03-06T13:23:51Z
triggerauthentications.keda.sh               2025-03-06T13:23:51Z

이 중 ScaledObjects 이 중요한 역할을 합니다. 이는 Event Source(예를 들어, Rabbit MQ)와 쿠버네티스 리소스(예를 들어, Deployment) 간의 의도하는 맵핑(desired mapping)을 나타냅니다.

아래 실습에 사용되는 명세를 살펴보면, cron을 바탕으로 트리거triggers되며, php-apache라는 Deployment를 대상 지정scaleTargetRef하고, spec에 HPA 오브젝트에 필요한 값이나 스케일링 속성을 지정합니다.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: php-apache-cron-scaled
spec:
  minReplicaCount: 0
  maxReplicaCount: 2
  pollingInterval: 30
  cooldownPeriod: 300
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  triggers:
  - type: cron
    metadata:
      timezone: Asia/Seoul
      start: 00,15,30,45 * * * *
      end: 05,20,35,50 * * * *
      desiredReplicas: "1"

참고로 scaledObject를 생성하면 HPA도 같이 생성이 됩니다. 아래에서 다시 살펴보겠습니다.

이제 샘플 애플리케이션과 ScaledObject를 생성합니다.

# keda 네임스페이스에 디플로이먼트 생성
cat << EOF > php-apache.yaml
apiVersion: apps/v1
kind: Deployment
metadata: 
  name: php-apache
spec: 
  selector: 
    matchLabels: 
      run: php-apache
  template: 
    metadata: 
      labels: 
        run: php-apache
    spec: 
      containers: 
      - name: php-apache
        image: registry.k8s.io/hpa-example
        ports: 
        - containerPort: 80
        resources: 
          limits: 
            cpu: 500m
          requests: 
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata: 
  name: php-apache
  labels: 
    run: php-apache
spec: 
  ports: 
  - port: 80
  selector: 
    run: php-apache
EOF

kubectl apply -f php-apache.yaml -n keda
kubectl get pod -n keda
...
php-apache-d87b7ff46-bbp8c                         0/1     ContainerCreating   0               3s

# ScaledObject 정책 생성 : cron
cat <<EOT > keda-cron.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: php-apache-cron-scaled
spec:
  minReplicaCount: 0
  maxReplicaCount: 2  # Specifies the maximum number of replicas to scale up to (defaults to 100).
  pollingInterval: 30  # Specifies how often KEDA should check for scaling events
  cooldownPeriod: 300  # Specifies the cool-down period in seconds after a scaling event
  scaleTargetRef:  # Identifies the Kubernetes deployment or other resource that should be scaled.
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  triggers:  # Defines the specific configuration for your chosen scaler, including any required parameters or settings
  - type: cron
    metadata:
      timezone: Asia/Seoul
      start: 00,15,30,45 * * * *
      end: 05,20,35,50 * * * *
      desiredReplicas: "1"
EOT
kubectl apply -f keda-cron.yaml -n keda

# 모니터링
kubectl get ScaledObject,hpa,pod -n keda
kubectl get ScaledObject -n keda -w

# HPA 확인 -> external 유형으로 생성되어 있음
 kubectl get hpa -o jsonpath="{.items[0].spec}" -n keda | jq
{
  "maxReplicas": 2,
  "metrics": [
    {
      "external": {
        "metric": {
          "name": "s0-cron-Asia-Seoul-00,15,30,45xxxx-05,20,35,50xxxx",
          "selector": {
            "matchLabels": {
              "scaledobject.keda.sh/name": "php-apache-cron-scaled"
            }
          }
        },
        "target": {
          "averageValue": "1",
          "type": "AverageValue"
        }
      },
      "type": "External"
    }
  ],
  "minReplicas": 1,
  "scaleTargetRef": {
    "apiVersion": "apps/v1",
    "kind": "Deployment",
    "name": "php-apache"
  }
}

이 예시는 cron을 통해서 00, 15, 30, 45분에 desiredReplicas에 지정한 1개의 파드가 증가하고, 이후 05, 20, 35, 40에 minReplicaCount에 지정된 0개로 파드가 줄어 듭니다. (이 예제에서 maxReplicaCount는 큰 의미가 없지만 생성되는 HPA 오브젝트에서 사용되기 때문에 작성이 필요함)

결과를 아래와 같이 확인할 수 있습니다. 다만 테스트를 해보면 Cron으로 정의한 45분(start), 50분(end)에 정시에 트리거가 되는 것은 아닌 것으로 확인되기 때문에 정확도에 대해서는 기대치를 다소 낮춰야 할 것 같습니다.

# 이전 시점
Thu Mar  6 22:44:53 KST 2025
NAME                                          SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   READY   ACTIVE   FALLBACK   PAUSED    TRIGGERS   AUTHENTICATIONS   AGE
scaledobject.keda.sh/php-apache-cron-scaled   apps/v1.Deployment   php-apache        0     2     True    False    False      Unknown                                12m

NAME                                                                  REFERENCE               TARGETS             MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/keda-hpa-php-apache-cron-scaled   Deployment/php-apache   <unknown>/1 (avg)   1         2         0          12m

NAME                                                   READY   STATUS    RESTARTS      AGE
pod/keda-admission-webhooks-86cffccbf5-nq7kw           1/1     Running   0             21m
pod/keda-operator-6bdffdc78-zrhmg                      1/1     Running   1 (20m ago)   21m
pod/keda-operator-metrics-apiserver-74d844d769-2rbqk   1/1     Running   0             21m

# 45분 이후
Thu Mar  6 22:45:29 KST 2025
NAME                                          SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   READY   ACTIVE   FALLBACK   PAUSED    TRIGGERS   AUTHENTICATIONS   AGE
scaledobject.keda.sh/php-apache-cron-scaled   apps/v1.Deployment   php-apache        0     2     True    True     False      Unknown                                12m

NAME                                                                  REFERENCE               TARGETS             MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/keda-hpa-php-apache-cron-scaled   Deployment/php-apache   <unknown>/1 (avg)   1         2         0          12m

NAME                                                   READY   STATUS    RESTARTS      AGE
pod/keda-admission-webhooks-86cffccbf5-nq7kw           1/1     Running   0             21m
pod/keda-operator-6bdffdc78-zrhmg                      1/1     Running   1 (21m ago)   21m
pod/keda-operator-metrics-apiserver-74d844d769-2rbqk   1/1     Running   0             21m
pod/php-apache-d87b7ff46-gblgb                         1/1     Running   0             9s


# ScaledObject 의 ACTIVE 상태가 45분 시점 True로 변경됨
kubectl get ScaledObject -n keda -w
NAME                     SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   READY   ACTIVE   FALLBACK   PAUSED    TRIGGERS   AUTHENTICATIONS   AGE
php-apache-cron-scaled   apps/v1.Deployment   php-apache        0     2     True    False    False      Unknown                                2m30s
php-apache-cron-scaled   apps/v1.Deployment   php-apache        0     2     True    False    False      Unknown                                7m
php-apache-cron-scaled   apps/v1.Deployment   php-apache        0     2     True    False    False      Unknown                                12m
php-apache-cron-scaled   apps/v1.Deployment   php-apache        0     2     True    True     False      Unknown                                12m
php-apache-cron-scaled   apps/v1.Deployment   php-apache        0     2     True    True     False      Unknown                                13m
...

# 40분 쯤에 ScaleTarget을 minReplicaCount로 변경 함
kubectl logs -f -n keda keda-operator-6bdffdc78-zrhmg
...
2025-03-06T13:39:52Z    INFO    scaleexecutor   Successfully set ScaleTarget replicas count to ScaledObject minReplicaCount     {"scaledobject.Name": "php-apache-cron-scaled", "scaledObject.Namespace": "keda", "scaleTarget.Name": "php-apache", "Original Replicas Count": 1, "New Replicas Count": 0}

# 45분 쯤에 keda-operator에서 ScaleTarget을 업데이트 함
kubectl logs -f -n keda keda-operator-6bdffdc78-zrhmg
...
2025-03-06T13:45:22Z    INFO    scaleexecutor   Successfully updated ScaleTarget        {"scaledobject.Name": "php-apache-cron-scaled", "scaledObject.Namespace": "keda", "scaleTarget.Name": "php-apache", "Original Replicas Count": 0, "New Replicas Count": 1}

# 54분에 ScaleTarget을 minReplicaCount로 변경 함
2025-03-06T13:54:52Z    INFO    scaleexecutor   Successfully set ScaleTarget replicas count to ScaledObject minReplicaCount     {"scaledobject.Name": "php-apache-cron-scaled", "scaledObject.Namespace": "keda", "scaleTarget.Name": "php-apache", "Original Replicas Count": 1, "New Replicas Count": 0}

참고로 Grafana에서 아래의 json으로 대시보드를 만들어 모니터링 할 수 있습니다.

https://github.com/kedacore/keda/blob/main/config/grafana/keda-dashboard.json

실습을 마무리하고 생성된 리소스를 삭제합니다.

# KEDA 및 deployment 등 삭제
kubectl delete ScaledObject -n keda php-apache-cron-scaled && kubectl delete deploy php-apache -n keda && helm uninstall keda -n keda
kubectl delete namespace keda

6. VPA(Vertical Pod Autoscaler)

VPA는 대상이 되는 리소스의 과거 사용률을 기반으로 대상의 컨테이너 스펙의 request와 limits 자체를 변경하는 오토스케일러 입니다.

VPA의 동작과정을 아래에서 설명하고 있습니다. 단, 아래 설명은 Update Mode: Auto이므로 파드가 재생성되는 과정까지를 설명하고 있습니다.

출처: https://www.youtube.com/watch?v=jLuVZX6WQsw

그림에서 표시된 바와 같이 VPA에는 아래와 같은 주요 컴포넌트가 있습니다.

참고: https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/components.md

Recommender - monitors the current and past resource consumption and, based on it, provides recommended values for the containers' cpu and memory requests.
Updater - checks which of the managed pods have correct resources set and, if not, kills them so that they can be recreated by their controllers with the updated requests.
Admission Controller - sets the correct resource requests on new pods (either just created or recreated by their controller due to Updater's activity).

VPA에서 추천값을 계산하는 방식은 과거 사용률을 바탕으로 기준값을 정하고 여기에 Margin을 더하여 결정합니다. 상세한 내용은 아래 링크를 참고 부탁드립니다.

https://devocean.sk.com/blog/techBoardDetail.do?ID=164786

또한 VPA에는 updateMode를 설정할 수 있는데 Auto(Default), Recreate, Initial, Off 입니다.

이해하기로는 현 시점 Auto와 Recreate은 동일하게 동작을 합니다. 즉 생성되는 파드와 실행 중인 파드를 변경하고, 필요하면 현재 실행 중인 파드를 재시작 합니다. 다만 이후 in-place resource resize(Kubernetes 1.27, Alpha)가 가능해지면, Auto 모드에서는 재시작을 하지않고, 현재 파드의 리소스를 수정만하는 방식으로 사용될 수 있습니다.

Initial은 파드가 생성되는 시점에만 추천 값을 적용하는 모드이고, Off는 추천 값을 VPA 오브젝트를 통해서 확인만 가능한 모드입니다.

상세한 설명은 아래 문서를 참고 부탁드립니다.

https://learn.microsoft.com/en-us/azure/aks/vertical-pod-autoscaler#vpa-object-operation-modes

이제 VPA를 설치하도록 하겠습니다.

참고: https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/docs/installation.md#install-command

# VPA 코드 다운로드
git clone https://github.com/kubernetes/autoscaler.git


# 만약 openssl 이 1.1.1 이하 버전인 경우, 1.1.1 이상 버전으로 업그레이드 필요함
openssl version
OpenSSL 1.0.2k-fips  26 Jan 2017

# (필요한 경우) 1.0 제거 
yum remove openssl -y

# (필요한 경우) openssl 1.1.1 이상 버전 확인 
yum install openssl11 -y

# 스크립트 파일 내에 openssl11 수정 (commit 까지 해야 수행됨)
cd autoscaler/vertical-pod-autoscaler/
sed -i 's/openssl/openssl11/g' pkg/admission-controller/gencerts.sh
git status
git config --global user.email "you@example.com"
git config --global user.name "Your Name"
git add .
git commit -m "openssl version modify"

# Deploy the Vertical Pod Autoscaler to your cluster with the following command.
watch -d kubectl get pod -n kube-system
./hack/vpa-up.sh

# (필요한 경우) openssl 관련 에러가 발생한다면 재실행!
sed -i 's/openssl/openssl11/g' pkg/admission-controller/gencerts.sh
./hack/vpa-up.sh

# 설치 후 확인
kubectl get pod -n kube-system |grep vpa
vpa-admission-controller-659c978dcd-zwn24     1/1     Running   0          106s
vpa-recommender-9bb6d98b7-gjqc7               1/1     Running   0          112s
vpa-updater-68db47986b-jqnnh                  1/1     Running   0          116s

# mutating webhook을 통해서 파드 생성 시점 설정을 변경함
kubectl get mutatingwebhookconfigurations vpa-webhook-config

NAME                 WEBHOOKS   AGE
vpa-webhook-config   1          101s

kubectl get mutatingwebhookconfigurations vpa-webhook-config -o json | jq
{
  "apiVersion": "admissionregistration.k8s.io/v1",
  "kind": "MutatingWebhookConfiguration",
  "metadata": {
    "creationTimestamp": "2025-03-06T14:14:31Z",
    "generation": 1,
    "name": "vpa-webhook-config",
    "resourceVersion": "22754",
    "uid": "03b88fcf-c1ff-4079-b33c-38c998829d50"
  },
  "webhooks": [
    {
      "admissionReviewVersions": [
        "v1"
      ],
      "clientConfig": {
        "caBundle": "<생략>",
        "service": {
          "name": "vpa-webhook",
          "namespace": "kube-system",
          "port": 443
        }
      },
      "failurePolicy": "Ignore",
      "matchPolicy": "Equivalent",
      "name": "vpa.k8s.io",
      "namespaceSelector": {
        "matchExpressions": [
          {
            "key": "kubernetes.io/metadata.name",
            "operator": "NotIn",
            "values": [
              ""
            ]
          }
        ]
      },
      "objectSelector": {},
      "reinvocationPolicy": "Never",
      "rules": [
        {
          "apiGroups": [
            ""
          ],
          "apiVersions": [
            "v1"
          ],
          "operations": [
            "CREATE"
          ],
          "resources": [
            "pods"
          ],
          "scope": "*"
        },
        {
          "apiGroups": [
            "autoscaling.k8s.io"
          ],
          "apiVersions": [
            "*"
          ],
          "operations": [
            "CREATE",
            "UPDATE"
          ],
          "resources": [
            "verticalpodautoscalers"
          ],
          "scope": "*"
        }
      ],
      "sideEffects": "None",
      "timeoutSeconds": 30
    }
  ]
}

설치를 마치면 VPA와 관련된 파드와 CRD를 확인하실 수 있습니다.

kubectl get crd | grep autoscaling
verticalpodautoscalercheckpoints.autoscaling.k8s.io   2025-03-06T14:13:55Z
verticalpodautoscalers.autoscaling.k8s.io             2025-03-06T14:13:55Z

아래와 같이 샘플 예제를 배포하여 VPA를 테스트 하겠습니다.

# 공식 예제 배포 (autoscaler/vertical-pod-autoscaler 위치에서 수행)
kubectl apply -f examples/hamster.yaml && kubectl get vpa -w
verticalpodautoscaler.autoscaling.k8s.io/hamster-vpa created
deployment.apps/hamster created
NAME          MODE   CPU   MEM   PROVIDED   AGE
hamster-vpa   Auto                          3s
hamster-vpa   Auto   511m   262144k   True       44s # VPA의 추천 값

# 파드 리소스 Requestes 확인
kubectl describe pod | grep Requests: -A2
    Requests:
      cpu:        100m
      memory:     50Mi
--
    Requests:
      cpu:        100m
      memory:     50Mi      


# VPA에 의해 기존 파드 삭제되고 신규 파드가 생성됨
kubectl get events --sort-by=".metadata.creationTimestamp" | grep VPA
19s         Normal   EvictedByVPA        pod/hamster-598b78f579-8gjfh        Pod was evicted by VPA Updater to apply resource recommendation.
19s         Normal   EvictedPod          verticalpodautoscaler/hamster-vpa   VPA Updater evicted Pod hamster-598b78f579-8gjfh to apply resource recommendation.

# 파드 리소스 Requestes 가 변경 됨
kubectl describe pod | grep Requests: -A2
    Requests:
      cpu:        100m
      memory:     50Mi
--
    Requests:
      cpu:        100m
      memory:     50Mi
--
    Requests:
      cpu:        511m
      memory:     262144k

실습을 마무리하고 아래와 같이 리소스를 삭제하겠습니다.

kubectl delete -f examples/hamster.yaml && ./hack/vpa-down.sh

마무리

생각보다 길이 길어져서 오토스케일링 개요, HPA, KEDA, VPA 까지만 이번 포스트에서 작성하고, CA, Karpenter, AKS의 오토스케일링, 스케일링 주의사항은 다음 포스트에서 이어서 작성하겠습니다.

저작자표시

'EKS' 카테고리의 다른 글

[6] EKS의 Security - EKS 인증/인가와 Pod IAM 권한 할당 (0)	2025.03.16
[5-2] EKS의 오토스케일링 Part2 (0)	2025.03.07
[4] EKS의 모니터링과 로깅 (0)	2025.03.01
[3-2] EKS 노드 그룹 (0)	2025.02.23
[3-1] EKS 스토리지 옵션 (0)	2025.02.23

curl의 다양한 옵션

한명 2025. 3. 6. 21:37

2025. 3. 6. 21:37

서비스 테스트에 자주 사용하는 Curl 옵션입니다.

timeout

curl을 옵션없이 사용하면 timeout이 될때까지 상당히 오래 대기합니다.
보통 curl을 통한 테스트는 성공 여부만 확인하면 되기 때문에 timeout을 지정하면 해당 시간이 지나면 timeout으로 처리합니다.

$ curl -m 3 192.168.0.1
curl: (28) Connection timed out after 3002 milliseconds

DNS resolution을 명시적으로 주기

보통 ingress의 경우 여로 host를 서비스 하는 경우도 있습니다.
이때 DNS 설정이 되지 않은 IP에 대해서 명시적으로 dns resolution을 지정해 줄 수 있습니다.

$ curl --resolve test.agic.contoso.com:80:20.20.20.20 http://test.agic.contoso.com
<html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>Microsoft-Azure-Application-Gateway/v2</center>
</body>
</html>

특정 Method 테스트 하기

테스트 목적으로 특정 Method의 응답 여부를 확인할 때가 있습니다.
이 경우 -X 옵션을 사용할 수 있습니다.

curl -X OPTIONS "https://url.com/default.css"

Status code 확인하기

response body가 아닌 status code 자체를 확인해야 하는 경우가 있습니다.
예를 들어, 네트워크 장비에서 health probe 를 하는데 이 경우 2xx~3xx 의 응답이 아닌 경우 실패로 간주 합니다.
보통은 프로세스가 정상적인 경우에 다음 액션으로 실제 응답을 확인해야하는 상황이 있습니다.

curl -w " - status code: %{http_code}" "http://url.com/"

bearer 토큰 지정

인증이 필요한 서비스에 대한 호출은 헤더로 토큰 값을 같이 넘겨야 하는 경우가 있습니다.
이 경우 아래와 같이 지정할 수 있습니다.

TOKEN=xxx
$ curl 'https://url.com' -H "Authorization: Bearer $TOKEN"

curl 호출 과정 보기

curl 을 통해 응답만 보는게 아니라 상세한 진행 과정을 확인하고 싶을 때가 있습니다.
tls 관련 이슈에서 사용할 수 있습니다.

이럴 때 -v 옵션을 사용할 수 있습니다.

$ curl -v https://url.com

Redirection 된 페이지 보기

curl을 수행했지만, 페이지가 redirection되는 경우 301 코드만 확인됩니다.
이 경우 아래와 같이 -L 옵션을 사용할 수 있습니다.

$ curl naver.com
<html>
<head><title>301 Moved Permanently</title></head>
<body>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx</center>
</body>
</html>

$ curl -L naver.com 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   162  100   162    0     0   4205      0 --:--:-- --:--:-- --:--:--  4263
100   138    0   138    0     0   2251      0 --:--:-- --:--:-- --:--:--  2251
   <!doctype html> <html lang="ko" class="fzoom"> <head> <meta charset="utf-8"> <meta name="Referrer" content="origin"> <meta http-equiv="X-UA-Compa
tible" content="IE=edge"> <meta name="viewport" content="width=1190"> <title>NAVER</title> 
...

curl로 파일 다운 받기

curl 실제 파일을 다운받아야 하는 경우가 있습니다. wget을 사용할 수 도 있습니다.
이때 -O 옵션을 사용합니다.

curl -o http://url.com/default.css a.css # 지정된 파일명으로 다운
curl -O http://url.com/default.css # 파일명 그대로 다운

저작자표시

'기타' 카테고리의 다른 글

CKA 취득 후기 (2025년 2월 18일 리뉴얼) (0)	2025.03.22
KCNA, KCSA 후기 (0)	2025.02.25
VS Code를 markdown editor로 사용하기 (0)	2025.02.06
VS Code에서 REST 테스트 하기 (0)	2023.11.05
wsl: docker, kind 설치 (0)	2023.11.05

[4] EKS의 모니터링과 로깅

한명 2025. 3. 1. 02:54

2025. 3. 1. 02:54

이번 포스트에서는 EKS의 모니터링과 로깅에 대해서 살펴보겠습니다.

먼저 모니터링과 옵저버빌리티에 대한 차이와 이에 사용되는 지표인 메트릭, 로그, 트레이싱에 대한 용어를 살펴보겠습니다. 이후 Kubernetes 환경의 모니터링 관점을 설명하고, EKS에서 어떤 방식으로 클러스터 모니터링을 제공하는지를 살펴보겠습니다. 그리고 마지막으로 AKS(Azure Kubernetes Service)의 모니터링을 비교해 보겠습니다.

본 포스트는 CSP 사에서 제공하는 메트릭과 이벤트, 로그 수준의 모니터링 관점으로 주로 설명을 드리겠습니다.

1. Monitoring과 Observability

과거부터 흔히 사용한 모니터링(Monitoring)이라는 단어가 익숙한데 비해 최근에 옵저버빌리티(Observability)라고 표현하는 관측 가능성이라는 용어는 조금 낯설기도 합니다. 이번 절에서는 모니터링과 옵저버빌리티를 구분을 해보고자 합니다.

그리고 모니터링과 옵저버빌리티에서 사용되는 Metric, Log, Tracing과 같은 용어도 알아 보겠습니다.

Monitoring과 Observability

모니터링은 전체적인 시스템 상태를 이해하고 감시하기 위한 활동입니다. 이를 위해 정의된 기준을 기반으로 성능 지표를 수집하여 예상치 못한 문제를 조기에 감지하는 과정을 말합니다.

과거부터 모니터링은 IT 시스템을 측정하고 장애를 예방하기 위한 목적으로 널리 사용되었지만, 시스템이 점차 다양해지고 분산 환경으로 변화됨에 따라, 개별 구성 요소들의 모니터링 지표로는 전체 서비스를 이해하기는 어려운 한계를 가지고 있습니다.

다만 시스템의 복잡성이 높아진다고 해서 모니터링 자체가 불필요 해지는 것은 아니며, 문제를 Drill down 하기 위해서는 개별 시스템의 모니터링 지표와 로그에서 유용한 정보를 찾을 필요도 있습니다.

옵저버빌리티 혹은 관측 가능성은 클라우드 네이티브 처럼 분산된 환경의 애플리케이션에서 발생할 수 있는 문제에 대해서 각 이벤트에 대한 통찰력을 제공하기 위해서, 각 마이크로 서비스 간의 태그, 로그를 결합해 컨텍스트(Context)를 제공하는 것이 목표입니다.

마이크로 서비스 환경에서 발생한 문제는 개별 시스템 차원의 분석으로는 설명하기는 어려울 수 있고, 각 마이크로 서비스 간의 연결 과정에서 파악해야 할 수 있습니다. 이를 파악하기 위해 지표나 이벤트를 통한 접근도 중요하지만 한편으로 이를 디버깅하는 과정에 옵저버빌리티의 역할이 커지고 있는 것 같습니다.

Metric, Log, Tracing

메트릭(Metrics): 특정 대상에 대한 정량화 되는 값을 의미합니다. 이는 시스템의 성능과 상태를 수치로 표현한 데이터로, 예를 들어, CPU 사용량, 메모리 사용량, 요청 지연 시간, 오류율 등이 있습니다.

시스템의 전반적인 상태를 한눈에 볼 수 있고, 이상 징후를 감지하도록 도움을 줍니다. 보통은 시간에 따른 추이를 보고, 임계치를 벗어나거나 혹은 패턴을 벗어나는 이상치를 감지할 수 있습니다.

로그(Logs): 시스템에서 발생한 이벤트를 기록한 텍스트나 구조화된 데이터 입니다. 로그는 보통 타임스탭프, 상태 코드, 상세 로그와 같은 형태로 기록되어, 이상 상황이나 오류가 발생한 시점을 기준으로 상세한 이유를 알 수 있어, 디버깅에 유용합니다. 예를 들어, 애플리케이션 로그, 시스템 이벤트 로그와 같은 형태입니다.

추적(Tracing): 분산 시스템에서 요청이 어떤 구성 요소들을 통해서 이동하는지 파악하는 것을 의미합니다. 요청의 흐름을 시각화 하고, 성능 이슈나 병목 혹은 이슈가 여러 서비스에 걸쳐 발생하는 경우 이를 진단하는데 도움을 줍니다. 예를 들어, 사용자가 어떤 화면을 접근했을 때, 그 요청이 어떤 내부 서비스를 거치는지 알 수 있습니다.

2. Kuberntes 환경의 모니터링

옵저버빌리티는 모니터링과는 다른 접근이나 솔루션이 필요하기 때문에 이후 설명은 쿠버네티스 관점에서 모니터링 관점으로 글을 이어 나가겠습니다.

물론 쿠버네티스를 자체를 모니터링 한다기 보다는 쿠버네티스 환경에서 실행되는 애플리케이션의 안전성을 위해 각 레이어별 모니터링을 설명하고자 합니다.

이를 설명을 하기 위해서 Azure에서 제공하는 아래 그림을 기반으로 설명을 이어 가겠습니다.

출처: https://learn.microsoft.com/ko-kr/azure/aks/monitor-aks

먼저 Level 1 클러스터가 위치한 네트워크 수준의 모니터링이고, Level 2로 클러스터 레벨 컴포넌트인 노드(가상머신 세트)를 모니터링 해야합니다.

그리고 Level 3은 쿠버네티스 컴포넌트에 대한 모니터링으로, 컨트롤 프레인 컴포넌트인 API 서버, Cloud Controller, Kubelet과 같은 요소들을 모니터링 해야합니다.

Level 4는 쿠버네티스 관점의 오브젝트와 워크로드에 대한 모니터링이며, 예를 들어, 컨테이너에 대한 메트릭과 파드의 재시작과 같은 이벤트도 포함해야 합니다. 한편으로 쿠버네티스에서 발생한 Event도 중요합니다.

Level 5는 애플리케이션 수준의 모니터링입니다. Level 4와 다소 겹칠 수 있지만, 애플리케이션 지표를 포함해 애플리케이션의 로그를 모니터링 할 수 있습니다.

그리고 이를 위한 모니터링 도구와 시각화 도구가 필요함을 설명하고 있습니다.

여기서 클라우드의 Managed Kubernetes Service 관점으로 생각해 볼 때, 리소스에 대한 CRUD 또한 모니터링 해야할 수도 있습니다. 예를 들어, 누가 새로운 리소스를 만들고, 구성을 변경하거나 잘못된 삭제를 한 것과 같은 이벤트입니다.

각 CSP(Cloud Service Provider)에서는 보통 각 상품을 위한 일반적인 모니터링을 기능을 제공하고 있습니다. AWS의 CloudWatch나 Azure의 Azure Monitor가 될 수 있습니다.

그리고 로깅 서비스를 통해 로그 데이터를 적재하고 쿼리를 통해 데이터를 조회하기 위한 기능을 제공합니다. AWS의 CloudWatch Logs와 Azure Log Analytics Workspace가 이에 해당합니다.

이후에는 EKS에서 모니터링을 어떤 방식으로 제공하는지를 모니터링과 로깅 관점으로 살펴보겠습니다.

3. 실습 환경 생성

EKS에서 제공하는 모니터링과 로깅을 살펴보기 위해서 아래와 같은 실습환경을 구성하도록 하겠습니다.

Note: 본 실습 환경은 AEWS(AWS EKS Workshop Study) 3기를 진행하 과정에서 제공받았습니다.

아래와 같이 CloudFormation을 통해 실습 환경을 배포합니다.

# YAML 파일 다운로드
curl -O https://s3.ap-northeast-2.amazonaws.com/cloudformation.cloudneta.net/K8S/myeks-4week.yaml

# 변수 지정
CLUSTER_NAME=myeks
SSHKEYNAME=ekskey
MYACCESSKEY=<IAM Uesr 액세스 키>
MYSECRETKEY=<IAM Uesr 시크릿 키>
WorkerNodeInstanceType=t3.xlarge # 워커노드 인스턴스 타입 변경 가능

# CloudFormation 스택 배포
aws cloudformation deploy --template-file myeks-4week.yaml --stack-name $CLUSTER_NAME --parameter-overrides KeyName=$SSHKEYNAME SgIngressSshCidr=$(curl -s ipinfo.io/ip)/32  MyIamUserAccessKeyID=$MYACCESSKEY MyIamUserSecretAccessKey=$MYSECRETKEY ClusterBaseName=$CLUSTER_NAME WorkerNodeInstanceType=$WorkerNodeInstanceType --region ap-northeast-2

# CloudFormation 스택 배포 완료 후 작업용 EC2 IP 출력
aws cloudformation describe-stacks --stack-name myeks --query 'Stacks[*].Outputs[0].OutputValue' --output text

# EC2 접속
ssh -i ~/.ssh/<key>.pem ec2-user@$(aws cloudformation describe-stacks --stack-name myeks --query 'Stacks[*].Outputs[0].OutputValue' --output text)

해당 CloudFormation 스택에 배포된 운영서버를 통해서 EKS 배포까지 포함되어 있습니다.

EKS 배포가 완료되기 위해 20분 정도를 기다리고, 이후 설치된 EKS를 확인해 보겠습니다.

# 변수 지정
CLUSTER_NAME=myeks
SSHKEYNAME=ekskey

# 클러스터 설치 확인
eksctl get cluster
NAME    REGION          EKSCTL CREATED
myeks   ap-northeast-2  True

eksctl get nodegroup --cluster $CLUSTER_NAME
CLUSTER NODEGROUP       STATUS  CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID                ASG NAME                                   TYPE
myeks   ng1             ACTIVE  2025-02-28T15:05:57Z    3               3               3                       t3.xlarge       AL2023_x86_64_STANDARD  eks-ng1-b4caa68b-dac3-4a9c-a489-7d63a0d70934       managed

eksctl get addon --cluster $CLUSTER_NAME
2025-03-01 00:26:31 [ℹ]  Kubernetes version "1.31" in use by cluster "myeks"
2025-03-01 00:26:31 [ℹ]  getting all addons
2025-03-01 00:26:33 [ℹ]  to see issues for an addon run `eksctl get addon --name <addon-name> --cluster <cluster-name>`
NAME                    VERSION                 STATUS  ISSUES  IAMROLE                                                                                 UPDATE AVAILABLE   CONFIGURATION VALUES            POD IDENTITY ASSOCIATION ROLES
aws-ebs-csi-driver      v1.40.0-eksbuild.1      ACTIVE  0       arn:aws:iam::430118812536:role/eksctl-myeks-addon-aws-ebs-csi-driver-Role1-Ks7b8mzq4vmu
coredns                 v1.11.4-eksbuild.2      ACTIVE  0
kube-proxy              v1.31.3-eksbuild.2      ACTIVE  0
metrics-server          v0.7.2-eksbuild.2       ACTIVE  0
vpc-cni                 v1.19.3-eksbuild.1      ACTIVE  0       arn:aws:iam::430118812536:role/eksctl-myeks-addon-vpc-cni-Role1-He4lLHyBeE62              enableNetworkPolicy: "true"

eksctl get iamserviceaccount --cluster $CLUSTER_NAME

NAMESPACE       NAME                            ROLE ARN
kube-system     aws-load-balancer-controller    arn:aws:iam::430118812536:role/eksctl-myeks-addon-iamserviceaccount-kube-sys-Role1-1GFoeNZ7Z43o

# kubeconfig 생성
aws sts get-caller-identity --query Arn
aws eks update-kubeconfig --name myeks --user-alias <위 출력된 자격증명 사용자>

# 기본 구성 정보 확인
kubectl cluster-info
Kubernetes control plane is running at https://7984C504F1BE86380015EB205905A2C5.gr7.ap-northeast-2.eks.amazonaws.com
CoreDNS is running at https://7984C504F1BE86380015EB205905A2C5.gr7.ap-northeast-2.eks.amazonaws.com/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

kubectl get node
NAME                                               STATUS   ROLES    AGE   VERSION
ip-192-168-1-115.ap-northeast-2.compute.internal   Ready    <none>   21m   v1.31.5-eks-5d632ec
ip-192-168-2-178.ap-northeast-2.compute.internal   Ready    <none>   21m   v1.31.5-eks-5d632ec
ip-192-168-3-168.ap-northeast-2.compute.internal   Ready    <none>   21m   v1.31.5-eks-5d632ec


kubectl get node --label-columns=node.kubernetes.io/instance-type,eks.amazonaws.com/capacityType,topology.kubernetes.io/zone
NAME                                               STATUS   ROLES    AGE   VERSION               INSTANCE-TYPE   CAPACITYTYPE   ZONE
ip-192-168-1-115.ap-northeast-2.compute.internal   Ready    <none>   21m   v1.31.5-eks-5d632ec   t3.xlarge       ON_DEMAND      ap-northeast-2a
ip-192-168-2-178.ap-northeast-2.compute.internal   Ready    <none>   21m   v1.31.5-eks-5d632ec   t3.xlarge       ON_DEMAND      ap-northeast-2b
ip-192-168-3-168.ap-northeast-2.compute.internal   Ready    <none>   21m   v1.31.5-eks-5d632ec   t3.xlarge       ON_DEMAND      ap-northeast-2c


kubectl get pod -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   aws-node-4jsvr                        2/2     Running   0          21m
kube-system   aws-node-bkp8n                        2/2     Running   0          21m
kube-system   aws-node-v5rhv                        2/2     Running   0          21m
kube-system   coredns-86f5954566-4j74x              1/1     Running   0          27m
kube-system   coredns-86f5954566-mcw5d              1/1     Running   0          27m
kube-system   ebs-csi-controller-549bf6879f-26wqx   6/6     Running   0          17m
kube-system   ebs-csi-controller-549bf6879f-qgqtz   6/6     Running   0          17m
kube-system   ebs-csi-node-8zr72                    3/3     Running   0          17m
kube-system   ebs-csi-node-sc6tt                    3/3     Running   0          17m
kube-system   ebs-csi-node-v48kr                    3/3     Running   0          17m
kube-system   kube-proxy-6wkjg                      1/1     Running   0          21m
kube-system   kube-proxy-v8228                      1/1     Running   0          21m
kube-system   kube-proxy-xw8hc                      1/1     Running   0          21m
kube-system   metrics-server-6bf5998d9c-2gngg       1/1     Running   0          27m
kube-system   metrics-server-6bf5998d9c-wv68w       1/1     Running   0          27m

이후 실습에 사용될 일부 구성 요소를 배포하겠습니다.

[Note]
참고로 본 실습 전에 Route 53에서 도메인을 생성했습니다. 만약 도메인 없는 경우 Loadbalancer 등으로 서비스를 변경해야 해서 실습에 제한이 있을 수 있습니다.
- Route53을 통한 도메인 구매: https://www.youtube.com/watch?v=4HBFozkJUeU

또한 해당 도메인에 대해서 AWS Certificate Manager를 통해 인증서를 발급 받아야 합니다.
- 인증서 발급: https://www.youtube.com/watch?v=mMpPlaUj-vI

이어서 진행하겠습니다.

# 환경 변수
MyDomain=aperson.link # 각자 자신의 도메인 이름 입력
MyDnzHostedZoneId=$(aws route53 list-hosted-zones-by-name --dns-name "$MyDomain." --query "HostedZones[0].Id" --output text)
CERT_ARN=$(aws acm list-certificates --query 'CertificateSummaryList[].CertificateArn[]' --output text) #사용 리전의 인증서 ARN 확인

# kube-ops-view
helm repo add geek-cookbook https://geek-cookbook.github.io/charts/
helm install kube-ops-view geek-cookbook/kube-ops-view --version 1.2.2 --set service.main.type=ClusterIP  --set env.TZ="Asia/Seoul" --namespace kube-system

# gp3 스토리지 클래스 생성
cat <<EOF | kubectl apply -f -
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp3
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
allowVolumeExpansion: true
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  allowAutoIOPSPerGBIncrease: 'true'
  encrypted: 'true'
  fsType: xfs # 기본값이 ext4
EOF
kubectl get sc

# ExternalDNS
curl -s https://raw.githubusercontent.com/gasida/PKOS/main/aews/externaldns.yaml | MyDomain=$MyDomain MyDnzHostedZoneId=$MyDnzHostedZoneId envsubst | kubectl apply -f -

# AWS LoadBalancerController
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller -n kube-system --set clusterName=$CLUSTER_NAME \
  --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller

# kubeopsview 용 Ingress 설정 : group 설정으로 1대의 ALB를 여러개의 ingress 에서 공용 사용
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/certificate-arn: $CERT_ARN
    alb.ingress.kubernetes.io/group.name: study
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
    alb.ingress.kubernetes.io/load-balancer-name: $CLUSTER_NAME-ingress-alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/success-codes: 200-399
    alb.ingress.kubernetes.io/target-type: ip
  labels:
    app.kubernetes.io/name: kubeopsview
  name: kubeopsview
  namespace: kube-system
spec:
  ingressClassName: alb
  rules:
  - host: kubeopsview.$MyDomain
    http:
      paths:
      - backend:
          service:
            name: kube-ops-view
            port:
              number: 8080
        path: /
        pathType: Prefix
EOF

또한 더불어 실습에 필요한 모니터링 데이터를 누적하기 위해서 샘플 애플리케이션도 같이 배포하겠습니다.

# Bookinfo 애플리케이션 배포
kubectl apply -f https://raw.githubusercontent.com/istio/istio/refs/heads/master/samples/bookinfo/platform/kube/bookinfo.yaml

# 확인
kubectl get all,sa

# product 웹 접속 확인
kubectl exec "$(kubectl get pod -l app=ratings -o jsonpath='{.items[0].metadata.name}')" -c ratings -- curl -sS productpage:9080/productpage | grep -o "<title>.*</title>"

# 로그
kubectl log -l app=productpage -f


# Ingress 배포
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/certificate-arn: $CERT_ARN
    alb.ingress.kubernetes.io/group.name: study
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}, {"HTTP":80}]'
    alb.ingress.kubernetes.io/load-balancer-name: $CLUSTER_NAME-ingress-alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/success-codes: 200-399
    alb.ingress.kubernetes.io/target-type: ip
  labels:
    app.kubernetes.io/name: bookinfo
  name: bookinfo
spec:
  ingressClassName: alb
  rules:
  - host: bookinfo.$MyDomain
    http:
      paths:
      - backend:
          service:
            name: productpage
            port:
              number: 9080
        path: /
        pathType: Prefix
EOF
kubectl get ingress

# bookinfo 접속 정보 확인 
echo -e "bookinfo URL = https://bookinfo.$MyDomain/productpage"
open "https://bookinfo.$MyDomain/productpage" # macOS

배포된 리소스를 확인해봅니다. external dns 를 통해 DNS가 등록되지만 전파에 시간이 걸릴 수 있습니다.

kubectl get ingress -n kube-system
NAME          CLASS   HOSTS                      ADDRESS                                                        PORTS   AGE
kubeopsview   alb     kubeopsview.aperson.link   myeks-ingress-alb-665851389.ap-northeast-2.elb.amazonaws.com   80      5m

kubectl get ingress
NAME       CLASS   HOSTS                   ADDRESS                                                        PORTS   AGE
bookinfo   alb     bookinfo.aperson.link   myeks-ingress-alb-665851389.ap-northeast-2.elb.amazonaws.com   80      7s

접속을 확인해봅니다.

샘플 애플리케이션도 정상적으로 실행 되었습니다.

로그 발생을 위해서 아래와 같이 반복 접속을 해볼 수 있습니다.

curl -s -k https://bookinfo.$MyDomain/productpage | grep -o "<title>.*</title>"
while true; do curl -s -k https://bookinfo.$MyDomain/productpage | grep -o "<title>.*</title>" ; echo "--------------" ; sleep 1; done
for i in {1..100};  do curl -s -k https://bookinfo.$MyDomain/productpage | grep -o "<title>.*</title>" ; done

4. EKS의 모니터링과 로깅

먼저 EKS의 모니터링과 로깅에 대한 설명을 아래 두가지 문서에서 살펴볼 수 있습니다.

https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/amazon-eks-logging-monitoring.html

EKS에서는 CloudWatch Logs를 통합하여 컨트롤 플레인 로그를 확인할 수 있습니다. 또한 CloudWatch agent를 EKS 노드에 배포하여 노드와 컨테이너 로그를 수집하는 방법도 제공합니다. 이때 Fluent Bit과 Fluentd가 컨테이너 로그를 수집하여 CloudWatch Logs로 전송하도록 지원합니다.

CloudWatch Container Insight는 EKS 클러스터, 노드, 파드, 서비스와 같은 수준의 모니터링을 제공하는 도구입니다. 또한 Prometheus를 통해 다양한 메트릭을 수집하는 방식도 제공합니다.

EKS의 모니터링 솔루션을 아래와 같은 그림으로 확인하실 수 있습니다.

출처: https://www.youtube.com/watch?v=349ywnrrROg

본 포스트에서는 CloudWatch Logs와 CloudWatch Container Insight 활용해 EKS 모니터링과 로깅을 확인해보겠습니다.

EKS 로깅

컨트롤 프레인 로깅

먼저 컨트롤 플레인 로깅을 먼저 살펴 보겠습니다.

컨트롤 플레인 로깅에서는 API Server, Audit, Authenticator, Controller manager, Scheduler와 같은 로그 유형을 제공하고 있습니다. 이에 대한 설명은 아래 문서를 참고 하실 수 있습니다.

https://docs.aws.amazon.com/eks/latest/userguide/control-plane-logs.html

컨트롤 플레인 로깅이 필요한 경우를 예를 들어보면, 특정 시점 생성된 오브젝트를 추적하기 위해 Audit 로그를 보거나, 클러스터가 비정상 동작하는 경우에 대한 API 서버 로그를 확인 하는 것, AWS 수준의 리소스와 연관된 경우 Controller manager 로그를 점검하는 것과 같은 상황이 있을 수 있습니다.

컨트롤 플레인 로깅은 웹 콘솔에서 Observability 탭으로 이동하여 아래로 이동하면 Control plane logs가 확인됩니다. eksctl 로 설치한 클러스터에는 기본적으로 모든 옵션이 off 인 것을 알 수 있습니다.

컨트롤 플레인 로깅을 활성화 하고 로그를 살펴보겠습니다.

# 모든 로깅 활성화
aws eks update-cluster-config --region ap-northeast-2 --name $CLUSTER_NAME \
    --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

웹 콘솔에서 확인해보면 로그가 활성화되었습니다.

그리고 CloudWatch를 접근해보면 새로운 Log group이 생성된 것이 확인됩니다.

해당 로그 그룹으로 진입하면, 아래와 같이 각 로그에 해당하는 Log stream이 생성된 것을 확인할 수 있습니다.

로그 스트림 중 하나를 선택하면 실제 로그를 확인하실 수 있습니다.

또한 CloudWatch Log Insights를 통해서 쿼리를 통해 로그를 확인할 수 있습니다.

# EC2 Instance가 NodeNotReady 상태인 로그 검색
fields @timestamp, @message
| filter @message like /NodeNotReady/
| sort @timestamp desc

# kube-apiserver-audit 로그에서 userAgent 정렬해서 결과 확인
fields userAgent, requestURI, @timestamp, @message
| filter @logStream ~= "kube-apiserver-audit"
| stats count(userAgent) as count by userAgent
| sort count desc

# kube-scheduler 로그 확인
fields @timestamp, @message
| filter @logStream ~= "kube-scheduler"
| sort @timestamp desc

# authenticator 로그 확인
fields @timestamp, @message
| filter @logStream ~= "authenticator"
| sort @timestamp desc

# kube-controller-manager 로그 확인
fields @timestamp, @message
| filter @logStream ~= "kube-controller-manager"
| sort @timestamp desc

이 중 kube-audit 로그를 통해 접근한 userAgent의 갯수를 확인한 예시 입니다.

또한 aws logs 명령으로 로그를 확인할 수도 있습니다.

# 로그 그룹 확인
aws logs describe-log-groups | jq

# 로그 tail 확인 : aws logs tail help
aws logs tail /aws/eks/$CLUSTER_NAME/cluster | more

# 신규 로그를 바로 출력
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --follow

# 필터 패턴
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --filter-pattern <필터 패턴>

# 로그 스트림이름
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --log-stream-name-prefix <로그 스트림 prefix> --follow
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --log-stream-name-prefix kube-apiserver --follow
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --log-stream-name-prefix kube-apiserver-audit --follow
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --log-stream-name-prefix kube-scheduler --follow
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --log-stream-name-prefix authenticator --follow
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --log-stream-name-prefix kube-controller-manager --follow
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --log-stream-name-prefix cloud-controller-manager --follow
kubectl scale deployment -n kube-system coredns --replicas=1
kubectl scale deployment -n kube-system coredns --replicas=2

# 시간 지정: 1초(s) 1분(m) 1시간(h) 하루(d) 한주(w)
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --since 1h30m

# 짧게 출력
aws logs tail /aws/eks/$CLUSTER_NAME/cluster --since 1h30m --format short

CloudWatch Logs에서 aws logs tail 과 같은 방식으로 손 쉽게 로그를 확인할 수 있는 방식을 제공하는 점이 큰 장점으로 보였습니다.

실습을 종료하고, 컨트롤 플레인 로그를 비활성화 화도록 하겠습니다.

# EKS Control Plane 로깅(CloudWatch Logs) 비활성화
eksctl utils update-cluster-logging --cluster $CLUSTER_NAME --region ap-northeast-2 --disable-types all --approve

# 로그 그룹 삭제
aws logs delete-log-group --log-group-name /aws/eks/$CLUSTER_NAME/cluster

노드와 애플리케이션 로깅

EKS의 노드와 컨테이너 모니터링을 위해서 CloudWatch agent와 Fluent Bit을 사용합니다. 두 파드는 데몬 셋으로 구성되어 아래와 같은 형태로 구성됩니다.

출처: https://aws.amazon.com/ko/blogs/containers/fluent-bit-integration-in-cloudwatch-container-insights-for-eks/

이들은 CloudWatch Observability라는 Addon으로 제공되므로, 아래와 같이 설치를 진행합니다.

# IRSA 설정
eksctl create iamserviceaccount \
  --name cloudwatch-agent \
  --namespace amazon-cloudwatch --cluster $CLUSTER_NAME \
  --role-name $CLUSTER_NAME-cloudwatch-agent-role \
  --attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
  --role-only \
  --approve

# addon 배포 (사전에 환경 변수가 정의된 EC2 인스턴스에서 실행)
aws eks create-addon --addon-name amazon-cloudwatch-observability --cluster-name $CLUSTER_NAME --service-account-role-arn arn:aws:iam::$ACCOUNT_ID:role/$CLUSTER_NAME-cloudwatch-agent-role

# addon 확인
aws eks list-addons --cluster-name myeks --output table
---------------------------------------
|             ListAddons              |
+-------------------------------------+
||              addons               ||
|+-----------------------------------+|
||  amazon-cloudwatch-observability  ||
||  aws-ebs-csi-driver               ||
||  coredns                          ||
||  kube-proxy                       ||
||  metrics-server                   ||
||  vpc-cni                          ||
|+-----------------------------------+|

# 설치 확인
kubectl get crd | grep -i cloudwatch

amazoncloudwatchagents.cloudwatch.aws.amazon.com   2025-02-28T16:27:24Z
dcgmexporters.cloudwatch.aws.amazon.com            2025-02-28T16:27:24Z
instrumentations.cloudwatch.aws.amazon.com         2025-02-28T16:27:25Z
neuronmonitors.cloudwatch.aws.amazon.com           2025-02-28T16:27:25Z

kubectl get all -n amazon-cloudwatch

NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/amazon-cloudwatch-observability-controller-manager-6f76854w9rvx   1/1     Running   0          69s
pod/cloudwatch-agent-dcfqq                                            1/1     Running   0          64s
pod/cloudwatch-agent-jcvk5                                            1/1     Running   0          65s
pod/cloudwatch-agent-r8tcw                                            1/1     Running   0          64s
pod/fluent-bit-6zbmk                                                  1/1     Running   0          69s
pod/fluent-bit-j9hl8                                                  1/1     Running   0          69s
pod/fluent-bit-zrw4v                                                  1/1     Running   0          69s

..


# cloudwatch-agent 설정 확인
kubectl describe cm cloudwatch-agent -n amazon-cloudwatch
kubectl get cm cloudwatch-agent -n amazon-cloudwatch -o jsonpath="{.data.cwagentconfig\.json}" | jq
{
  "agent": {
    "region": "ap-northeast-2"
  },
  "logs": {
    "metrics_collected": {
      "application_signals": {
        "hosted_in": "myeks"
      },
      "kubernetes": {
        "cluster_name": "myeks",
        "enhanced_container_insights": true
      }
    }
  },
  "traces": {
    "traces_collected": {
      "application_signals": {}
    }
  }
}

#Fluent bit 파드 수집하는 방법 : Volumes에 HostPath를 통해서 Node Log, Container Log에 접근함
kubectl describe -n amazon-cloudwatch ds cloudwatch-agent
...
  Volumes:
   ...
   rootfs:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  


# Fluent Bit 로그 INPUT/FILTER/OUTPUT 설정 확인
## 설정 부분 구성 : application-log.conf , dataplane-log.conf , fluent-bit.conf , host-log.conf , parsers.conf
kubectl describe cm fluent-bit-config -n amazon-cloudwatch
...
application-log.conf:
----
[INPUT]
    Name                tail
    Tag                 application.*
    Exclude_Path        /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
    Path                /var/log/containers/*.log
    multiline.parser    docker, cri
    DB                  /var/fluent-bit/state/flb_container.db
    Mem_Buf_Limit       50MB
    Skip_Long_Lines     On
    Refresh_Interval    10
    Rotate_Wait         30
    storage.type        filesystem
    Read_from_Head      ${READ_FROM_HEAD}
...

[FILTER]
    Name                kubernetes
    Match               application.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_Tag_Prefix     application.var.log.containers.
    Merge_Log           On
    Merge_Log_Key       log_processed
    K8S-Logging.Parser  On
    K8S-Logging.Exclude Off
    Labels              Off
    Annotations         Off
    Use_Kubelet         On
    Kubelet_Port        10250
    Buffer_Size         0

[OUTPUT]
    Name                cloudwatch_logs
    Match               application.*
    region              ${AWS_REGION}
    log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
    log_stream_prefix   ${HOST_NAME}-
    auto_create_group   true
    extra_user_agent    container-insights
...

Addon을 통해 생성되는 로그 그룹과 대응하는 로그는 아래와 같습니다.

출처: https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/kubernetes-eks-logging.html#eks-node-application-logging

CloudWatch Logs에 보면 아래와 같은 로그 그룹이 생성된 것을 알 수 있습니다.

다만 의아한 부분은 분명 configmap에 log_group_name에 /host가 있는데 이것은 생성되지 않았고, /performance라는 로그 그룹이 추가 되어 있습니다.

kubectl describe cm fluent-bit-config -n amazon-cloudwatch |grep log_group_name
  log_group_name      /aws/containerinsights/${CLUSTER_NAME}/application
  log_group_name      /aws/containerinsights/${CLUSTER_NAME}/dataplane
  log_group_name      /aws/containerinsights/${CLUSTER_NAME}/host

fluent-bit 에서도 log group 생성이 실패한 것으로 보입니다.

kubectl logs -f -n amazon-cloudwatch fluent-bit-zrw4v
AWS for Fluent Bit Container Image Version 2.32.5
Fluent Bit v1.9.10
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2025/02/28 16:27:34] [error] [filter:kubernetes:kubernetes.1] [kubernetes] no upstream connections available to cloudwatch-agent.amazon-cloudwatch:4311
[2025/02/28 16:27:39] [error] [output:cloudwatch_logs:cloudwatch_logs.0] CreateLogGroup API responded with error='OperationAbortedException', message='A conflicting operation is currently in progress against this resource. Please try again.'
[2025/02/28 16:27:39] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to create log group
[2025/02/28 16:27:39] [error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send events

버그인거 같지만 Addon에서 관리되는 영역이라 확인이 어려운 것 같습니다.

https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_CreateLogGroup.html

OperationAbortedException
Multiple concurrent requests to update the same resource were in conflict.HTTP Status Code: 400

/performance 로그 그룹은 container Insight를 위한 성능 데이터를 CloudWatch Agent가 쌓고 있는 것으로 보입니다.

kubectl logs -f -n amazon-cloudwatch   cloudwatch-agent-dcfqq |grep log_group_name
        log_group_name: /aws/application-signals/data
        log_group_name: /aws/containerinsights/{ClusterName}/performance

EKS 메트릭 기반 모니터링

앞서 설치한 CloudWatch Observability 애드온에 으해서 Container Insight에 해당하는 메트릭도 수집됩니다.

웹 콘솔에서 내용을 살펴 보겠습니다.

CloudWatch → Insights → Container Insights 으로 접근할 수 있습니다.

우측 상단의 View performance dashboard를 눌러면 여러가지 뷰로 다양한 메트릭과 그래프를 확인 가능합니다.

생성된 특정 리소스(네임스페이스와 Workload로 선택)를 선택한 경우 해당 각종 메트릭을 확인할 수 있습니다.

전반적으로 Container Insight를 통해 확인하는 정보들이 체계적으로 분류되어 있고, 각 항목에서 시각화가 잘되어 있는 점이 인상 깊습니다. 그리고 각 View에 대한 response time도 빨랐습니다.

실습을 마무리하고 애드온과 생성된 로그 그룹을 삭제하도록 하겠습니다.

aws eks delete-addon --cluster-name $CLUSTER_NAME --addon-name amazon-cloudwatch-observability

aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/application
aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/dataplane
aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/host
aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/performance

EKS 리소스 이벤트 확인

EKS는 AWS CloudTrail과 통합되어 있습니다. CloudTrail는 리소스에 일어난 활동을 기록하는 서비스입니다. CloudTrail은 EKS에 일어난 모든 API 요청을 이벤트로 기록하고 있습니다. 여기에는 웹 콘솔이나 코드를 통한 Amazon EKS API operation을 모두 포함하고 있습니다.

Trails를 생성했다면 이러한 로그를 Amazon S3 bucket을 통해 장기 보관할 수 있으며, 특별한 설정을 하지 않아도 CloudTrail의 Event histry를 통해서 리소스에 발생한 활동을 확인할 수 있습니다.

보통 이런 이벤트를 확인하는 경우는 특정 리소스의 변경이나 이벤트가 EKS에 의한 것인지 아니면 사용자에 의한 것인지 확인이 필요한 경우 등이 있습니다.

아래와 같이 확인이 가능합니다.

해당 항목을 누르면 UserIdentity, sourceIPAddress, userAgent와 같은 상세한 내용을 확인할 수 있습니다.

AWS CloudTrail은 리소스의 변경 사항 뿐 아니라, Read를 발생시킨 요청에 대해서도 기록하고 있는 점이 인상 깊습니다.

EKS의 CloudTail에 대해서 아래 문서를 확인하실 수 있습니다.

https://docs.aws.amazon.com/eks/latest/userguide/logging-using-cloudtrail.html

여기서 EKS의 모니터링과 로깅을 마무리 하겠습니다. EKS에서도 다양한 메트릭을 제공과 시각화를 위해서 Amazon Managed Prometheus 와 Amazon Managed Grafana 서비스를 제공하고 있습니다.

5. AKS의 모니터링과 로깅

Azure의 모니터링과 로깅 솔루션으로 Azure Monitor와 Log Analytics Workspace가 있습니다.

각 AWS의 CloudWatch와 CloudWatch Logs와 대응합니다. Azure Monitor에서는 사전에 제공하는 뷰나 신규 블레이드를 생성하여 데이터를 확인할 수 있으며, Log Analytics Workspace는 테이블 형태로 데이터를 수집하므로 KQL(Kusto Query Language)를 통해서 쿼리를 수행할 수 있습니다.

또한 AKS 환경에 전문화된 메트릭/로깅을 제공하기 위해 Container Insight를 제공하고 있습니다.

전반적인 AKS 모니터링 옵션을 아래 문서에서 설명하고 있습니다.

https://learn.microsoft.com/en-us/azure/aks/monitor-aks?tabs=cilium

이 중 EKS에서 살펴본 순서대로 로깅과 메트릭 부분을 살펴보겠습니다.

AKS 로깅

먼저 컨트롤 플레인 로그는 위 테이블의 Resource logs 에서 설명하고 있습니다. Azure에서는 각 상품별로 진단 설정(Diagnostics setting)을 할 수 있는데, AKS에서는 진단 설정을 통해서 컨트롤 플레인 로그를 선택적으로 수집할 수 있습니다.

진단 설정에서 제공되는 항목은 아래와 같습니다. EKS와 다르게 CA나 CSI controller에 해당하는 파드들이 컨트롤 플레인에 구성되므로 해당 컴포넌트에 대한 로그도 진단 설정에서 선택할 수 있습니다.

다음으로 노드와 애플리케이션 모니터링을 위해서 Container Insight를 설정할 수 있습니다. Container Insight의 로그는 Log Analytics Workspace에 저장되어, 비용 측면에 아래와 같이 사전에 정의된 세트를 지정할 수 있습니다.

이후 수집 설정을 수정을 눌러보면 어떤 로그/메트릭 유형이 수집되는지 확인할 수 있습니다.

수집을 원하는 항목을 선택할 수 있으며, 성능관련 지표나, 컨테이너 로그, 그리고 각 오브젝트의 상태나 쿠버네티스 Event와 같은 정보를 수집할 수 있는 것을 알 수 있습니다.

이러한 항목은 Log Analytics Workspace에 개별 테이블로 저장되며, 클러스터의 Monitoring>Logs를 통해서 접근하거나 혹은 Log Analytics Workspace로 직접 접근해 쿼리를 사용할 수 있습니다.

아래 샘플 쿼리를 참고 부탁드립니다.

https://docs.azure.cn/en-us/azure-monitor/reference/queries/containerlog

Container Insight의 모니터링에도 Performance, Metrics를 볼 수 있지만 최근에는 Prometheus Metric으로 전환되는 방향성을 가진 것 같기도 합니다.

AKS 메트릭 기반 모니터링

다음으로 메트릭을 살펴보겠습니다.

Azure는 플랫폼 메트릭으로 리소스 별로 기본 제공되는 메트릭을 무료로 제공합니다. 보통 고급 모니터링 기능을 활성화 하지 않은 상태에서도 AKS>Monitoring>Metrics에서 일부 값들을 확인하실 수 있습니다.

예를 들어, 노드 상태, 파드 상태나 노드 리소스 메트릭 등이 선택가능하며, AKS도 최근 컨트롤 플레인 메트릭을 Preview로 제공하고 있습니다.

플랫폼 메트릭에 대한 전체 메트릭 설명은 아래를 참고하실 수 있습니다.

https://learn.microsoft.com/en-us/azure/aks/monitor-aks-reference#metrics

최근 Container Insight에서 Prometheus 메트릭과 로깅을 활성화 한 경우 AKS Monitor Experience이 크게 개선되었으며 현재 Preview 상태입니다.

https://techcommunity.microsoft.com/blog/azureobservabilityblog/public-preview-the-new-aks-monitoring-experience/4297181

AKS 리소스 이벤트 확인

마지막으로 Activity Log에서 해당 리로스에 대한 이벤트를 확인할 수 있습니다.

EKS와 마찬가지로 AKS에서도 Managed Prometheus와 Managed Grafana를 통해서 모니터링을 통합할 수 있는 기능이 제공됩니다.

참고로 기본 AKS>Monitoring>Insight로 접근하던 Container Insight가 AKS의 Monitor로 변경이 되었습니다. 아래 화면은 Monitor Settings으로, Container Logs 설정 외에도 Managed Prometheus와 Managed Grafana를 선택 할 수 있습니다.

다만 EKS와 비교해 보면 CloudWatch Container Insight가 조금 더 완성도 있는 구성과 시각화를 보여주는 것 같습니다.

6. 리소스 정리

실습에 사용된 환경을 아래와 같이 정리하도록 하겠습니다.

nohup sh -c "eksctl delete cluster --name $CLUSTER_NAME && aws cloudformation delete-stack --stack-name $CLUSTER_NAME" > /root/delete.log 2>&1 &

CloudWatch Logs가 비용이 많이 드는 것으로 알려져 있어 모든 로그 그룹이 삭제되었는지 꼭 확인하시기 바랍니다.

# 로그 그룹 삭제 : 컨트롤 플레인
aws logs delete-log-group --log-group-name /aws/eks/$CLUSTER_NAME/cluster

# 로그 그룹 삭제 : 데이터 플레인
aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/application
aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/dataplane
aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/host
aws logs delete-log-group --log-group-name /aws/containerinsights/$CLUSTER_NAME/performance

마지막으로 EC2에서 혹시나 남아 있는 볼륨(Prometheus용 PV 등)들이 있다면 확인 후 모두 삭제해야 하시기 바랍니다.

마무리

해당 포스트를 통해서 Kubernetes 환경의 모니터링에 대해서 살펴보고, 이러한 모니터링을 EKS에서 어떻게 활성화 하고 지표를 살펴볼 수 있는지 확인했습니다. 또한 AKS의 모니터링 제공 수준과 살펴보고 서로 비교해봤습니다.

이 과정은 대체적으로 CSP에서 제공하는 옵션을 위주로 설명을 하였습니다. 반면 일부 사용자는 Prometheus나 Grafana와 같은 오픈소스로 직접 모니터링을 구성하기도 합니다. 그리고 모니터링을 전문으로 하는 SaaS 서비스를 사용할 수도 있습니다. CSP의 모니터링 솔루션을 사용할 것인지 혹은 오픈소스나 다른 형태의 모니터링을 사용할지는 사용자에게 달려있습니다.

살펴보기로는 CSP에서는 옵저버빌리티 수준으로 모니터링을 고도화 해가는 방향성을 가지고 있는 것을 확인할 수 있습니다. 다만 CSP 모니터링 솔루션이 비용 효율적인지에 대한 의문과, 또한 블레이드나 혹은 알림과 같은 기능에 커스터마이즈에 한계가 있기도 합니다. 그리고 멀티 클라우드 환경이라면 각 사별로 서로 다른 모니터링 스택을 관리해야하는 문제도 있습니다.

그러한 측면에서 오픈소스 모니터링 솔루션을 사용할 수 있습니다. 다만 각 클러스터별로 Prometheus Stack을 위한 컴포넌트가 배포되는 중복성이나 비용이 발생하는 측면과, 또한 모니터링 솔루션을 자체를 관리해야 하는 부가적인 업무도 부담이 되기는 합니다. 다만 시각화나 모니터링 항목의 커스터마이즈가 가능한 점, 다양한 환경에 동일한 모니터링 스택을 배포하며 통일된 환경을 구성할 수 있는 장점이 있습니다.

모니터링을 전문으로 하는 SaaS 서비스를 사용하는 옵션도 있습니다. 대체적으로 이러한 솔루션은 우수하지만 데이터를 전송하는 비용 및 보안적인 우려가 있을 수도 있고, 솔루션 자체의 비용도 부담이 될 수 있습니다.

어떤 모니터링을 솔루션을 사용하는 것에는 장/단점이 있기 때문에 이는 사용자의 선택이나 기술적 판단이 필요할 수 있습니다.

해당 포스트는 여기서 마무리 하도록 하겠습니다.

저작자표시

'EKS' 카테고리의 다른 글

[5-2] EKS의 오토스케일링 Part2 (0)	2025.03.07
[5-1] EKS의 오토스케일링 Part1 (0)	2025.03.07
[3-2] EKS 노드 그룹 (0)	2025.02.23
[3-1] EKS 스토리지 옵션 (0)	2025.02.23
[2-2] EKS Networking Part2 - LoadBalancer와 Ingress (0)	2025.02.16

KCNA, KCSA 후기

한명 2025. 2. 25. 23:36

2025. 2. 25. 23:36

최근에 만료된 Kubernetes 자격증을 갱신하면서 신규 자격증인 KCNA와 KCSA가 있어 추가로 취득했습니다.

이번 포스트에서는 두 가지 자격증에 대해서 간략히 소개하고 시험 후기를 간단히 남기려고 합니다.

공통

두 시험 모두 PSI 를 통해서 시험을 응시하는데 다른 시험들에 비해서 노트북 캠으로 사전 점검을 엄청 열심히 합니다.

듀얼 모니터는 지원되지 않으며, 책이나 종이 필기류 뭔가 눈에 띄는 모든 것에 대해서 확인을 합니다.

그냥 노트북, 마우스만 두고 시험을 응시하는게 시간 낭비하지 않는 방법입니다.

KCNA와 KCSA는 90분간 60문제 4지선다의 시험이고, 자격은 2년간 유효합니다.

또한 클라우드 자격증과는 다르게 상황 설명이 복잡하지 않습니다. 대체적으로 질문과 제시된 응답이 간결한 편입니다.

예를 들어, 어떤 역할을 하는 컴포넌트나 제품을 찾거나, 혹은 특정 상황에서 발생할 수 있는 이슈를 선택하는 등입니다.

시험에 큰 부담은 없지만 언어 선택은 영어만 가능하기 때문에 시험 후 약간의 피로는 있을 수 있습니다.

또한 두 가지 시험 모두 쿠버네티스의 컨트롤 플레인 구성 요소와 역할, 그리고 각 리소스가 어떤 경우 필요한 지에 대한 기본적인 이해를 묻는 문항이 있습니다.

KCNA(Kubernetes and Cloud Native Associate)

KCNA는 쿠버네티스와 클라우드 네이티브 어소시에이트 자격입니다. Associate 자체가 Expert 수준의 자격은 아니기 때문에 업무에서 쿠버네티스 환경을 사용하고, 기본적인 이해가 있는 분은 큰 부담없이 접수하고 시험을 처도 무방한 시험입니다.

출제 범위(Domain)

출처: https://training.linuxfoundation.org/certification/kubernetes-cloud-native-associate/

위 Linux Foundation 링크에서 세부 주제들을 확인하실 수 있으며, 대충 아는 부분은 넘어가고 아리송한 부분만 한번 학습하시면 될 것 같습니다. 시험 범위는 쿠버네티스 + 클라우드 네이티브이기 때문에 관련된 용어이나 개발 프로세스 등도 포함된다고 보면 됩니다.

다만, 정말 어렵지 않은 시험입니다.

아래와 같이 참고 자료 남겨드립니다.

https://github.com/moabukar/Kubernetes-and-Cloud-Native-Associate-KCNA/blob/main/docs/kcna/questions.md

https://www.itexams.com/exam/KCNA

(두번째 링크는 초반에 20문제 정도는 무료로 볼 수 있습니다)

KCSA(Kubernetes and Cloud Native Security Associate)

KCSA는 쿠버네티스와 클라우드 네이티브에 대한 보안 어소시에이트 자격입니다. 마찬가지로 쿠버네티스 환경에 익숙하다면 큰 부담은 없을 수 있지만, 이 시험은 보안과 관련된 주제에 대해서는 추가로 학습이 필요합니다.

출제 범위(Domain)

출처: https://training.linuxfoundation.org/certification/kubernetes-and-cloud-native-security-associate-kcsa/

해당 시험에도 쿠버네티스 컨트롤 플레인에 대한 문항이 있고, 또 대략 10+문제 정도는 그냥 감으로 봐도 보안적으로 좋은(?) 답변을 선택하면 됩니다.

다만 한 2~30문제 정도는 정말 해당 토픽에 대한 이해가 필요하기 때문에 대략 2~3시간 정도는 공부를 하시는 게 좋을 것 같습니다.

쿠버네티스의 보안을 위한 서비스계정, 네트워크 정책, RBAC, Authz, Admission Controller 같은 내용도 알아야 하고, 그외 쿠버네티스 환경에서 보안을 강화해주는 구성요소나 제품(SECCOMP, AppArmor, gVisor, Falco, FireCraker 등)들도 어떤 역할을 하는지 잘 이해해야 합니다. 마지막으로 보안 관련 규정이나 용어들도 친숙해질 필요가 있습니다.

위 출제 범위를 확장해서 각 카테고리에서 모르는 용어가 있으면 보고 가시기 바랍니다.

아쉽게도 KCSA는 잘 정리된 자료를 찾기가 어렵습니다.

개인적으로 KCSA는 후기들을 한번씩 보고, 참고 링크들을 위주로 학습 했습니다.

혹시나 도움이 되셨으면 하며 마무리 하겠습니다.

저작자표시

'기타' 카테고리의 다른 글

CKA 취득 후기 (2025년 2월 18일 리뉴얼) (0)	2025.03.22
curl의 다양한 옵션 (0)	2025.03.06
VS Code를 markdown editor로 사용하기 (0)	2025.02.06
VS Code에서 REST 테스트 하기 (0)	2023.11.05
wsl: docker, kind 설치 (0)	2023.11.05

PREV 이전 1 2 3 4 5 NEXT 다음

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

전체 글

목차

1. EKS의 업그레이드와 전략

EKS 업그레이드 과정

EKS 업그레이드 전략

2. 실습 환경 개요

3. In-place 클러스터 업그레이드

3.1. 컨트롤 플레인 업그레이드

3.2. Addons 업그레이드

3.3. 관리형 노드 그룹 업그레이드

관리형 노드 그룹 In-place 업그레이드

관리형 노드 그룹 Blue/Green 업그레이드

3.4. Karpenter 노드 업그레이드

3.5. Self-managed 노드 업그레이드

3.6. Fargate 노드 업그레이드

4. Blue/Green 클러스터 업그레이드

'EKS' 카테고리의 다른 글

목차

1. 실습 환경 구성

Docker Hub 설정

Kubernetes 환경 구성

Gogs, Jenkins 설치

Gogs 초기 설정

리파지터리 생성

토큰 생성

Jenkins 초기 설정

Argo CD 초기 설정

2. Jenkins를 통한 CI 구성

수동으로 빌드하는 Item 생성

자동 빌드 수행되는 Item 생성

3. Argo CD를 통한 CD 구성

조금 더 개선해보기

마무리

'Kubernetes' 카테고리의 다른 글

목차

1. EKS Fargate

2. AKS Virtual Nodes

마무리

'EKS' 카테고리의 다른 글

기본정보

컨텐츠 리뉴얼

'기타' 카테고리의 다른 글

목차

1. Kubernetes의 인증/인가

2. EKS의 인증/인가

클러스터 엑세스: ConfigMap

클러스터 엑세스: EKS API

3. AKS의 인증/인가

사전 지식1

사전 지식2

사전 지식3

사전 지식4

Local accounts with Kubernetes RBAC

Microsoft Entra ID authentication with Kubernetes RBAC

Microsoft Entra ID authentication with Azure RBAC

AKS의 인증/인가 요약

4. Kubernetes의 파드 권한

5. EKS의 파드 권한 할당

IRSA(IAM Roles for Service Accounts)

Pod Identity

6. AKS의 파드 권한 할당

마무리

'EKS' 카테고리의 다른 글

목차

EKS의 오토스케일링 Part1 (https://a-person.tistory.com/38)

EKS의 오토스케일링 Part2

1. CA(Cluster Autoscaler)

EKS CA 테스트

AKS CA 테스트

2. Karpenter

3. AKS의 오토스케일링

KEDA: add-on으로 제공

VPA: --enable-vpa 옵션

Cluster Autoscaler: --enable-cluster-autoscaler 옵션

Karpenter: NAP(Node Autoprovisioning)으로 활성화

4. 오토스케일링에 대한 주의사항

마무리

'EKS' 카테고리의 다른 글

목차

EKS의 오토스케일링 Part1

VPA: `--enable-vpa` 옵션

Cluster Autoscaler: `--enable-cluster-autoscaler` 옵션