Rancher 的異常排解紀錄

  • [email protected]:~$ kubectl get pod -n cattle-system
    NAME                                    READY   STATUS             RESTARTS   AGE
    cattle-cluster-agent-6bf6f8fcc4-sznpp   1/1     Running            0          18m
    cattle-node-agent-79nrh                 1/1     Running            23         67d
    cattle-node-agent-ch6pn                 1/1     Running            23         67d
    cattle-node-agent-jr5bq                 1/1     Running            7          7d20h
    cattle-node-agent-k2fcs                 1/1     Running            26         67d
    rancher-98d8d5cf5-hbjjv                 1/1     Running            1          25m
    rancher-98d8d5cf5-nhlwz                 0/1     CrashLoopBackOff   8          25m
    rancher-98d8d5cf5-zjbzs                 0/1     Running            0          105s

  1. 找出哪個 rancher pod 是 leader

    $ kubectl describe configMap cattle-controllers -n kube-system
    Name:         cattle-controllers
    Namespace:    kube-system
    Labels:       <none>
    Annotations:  control-plane.alpha.kubernetes.io/leader:
                    {"holderIdentity":"rancher-98d8d5cf5-hbjjv","leaseDurationSeconds":45,"acquireTime":"2021-09-08T06:40:25Z","renewTime":"2021-09-08T07:02:5...
    
    Data
    ====
    Events:  <none>

  2. 可以看到目前的 leader : rancher-98d8d5cf5-hbjjv , 所以可以看一下這 pod 的紀錄

    $ kubectl logs rancher-98d8d5cf5-hbjjv -n cattle-system
    2021/09/08 06:38:27 [INFO] Rancher version v2.4.15 (cdb64d640) is starting
    2021/09/08 06:38:27 [INFO] Rancher arguments {ACMEDomains:[] AddLocal:auto Embedded:false HTTPListenPort:80 HTTPSListenPort:443 K8sMode:auto Debug:false Trace:false NoCACerts:false AuditLog
    Path:/var/log/auditlog/rancher-api-audit.log AuditLogMaxage:10 AuditLogMaxsize:100 AuditLogMaxbackup:10 AuditLevel:0 Features:}
    2021/09/08 06:38:27 [INFO] Listening on /tmp/log.sock
    I0908 06:38:27.719747       6 http.go:122] HTTP2 has been explicitly disabled
    :
    2021/09/08 06:56:18 [ERROR] AppController p-gn54t/test-20210831-master-sq [helm-controller] failed with : Get "https://10.43.0.1:443/apis/project.cattle.io/v3/namespaces/p-gn54t/apprevisions?labelSelector=io.cattle.field%!F(MISSING)appId%!D(MISSING)test-20210831-master-sq&timeout=30s": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
    2021/09/08 06:57:04 [ERROR] PipelineExecutionController p-gn54t/p-qp9qq-1 [pipeline-execution-controller] failed with : pipeline.project.cattle.io "p-gn54t/p-qp9qq" not found
    2021/09/08 07:01:20 [ERROR] PipelineExecutionController p-gn54t/p-qp9qq-1 [pipeline-execution-controller] failed with : pipeline.project.cattle.io "p-gn54t/p-qp9qq" not found

  • 假設以下的 jenkins POD 不見了! PIPELINE 就無法啟動運行

    ~$ kubectl get namespace | grep pipeline
    cattle-pipeline               Active   66d
    p-gn54t-pipeline              Active   66d
    ~$ kubectl get pod -n p-gn54t-pipeline
    NAME                               READY   STATUS    RESTARTS   AGE
    docker-registry-57fbddc6cc-drt29   1/1     Running   4          66d
    jenkins-75cf8d9966-m2vc8           1/1     Running   0          168m
    minio-7b7866c65f-7hpl5             1/1     Running   0          167m

  • 只要將 pipeline 這個 namespace Exp. p-gn54t-pipeline 刪除, 就會自動建立回來
  • 環境 : rke / helm 安裝的 rancher
  • 透過 helm uninstall 後, 再執行 helm install 後依然無法正常啟動
  • 參考這篇乾淨移除 Rancher與這篇Rancher 中的 CRD說明後, 依照以下的處理方式就能解決
    1. 刪除 crd 的 dynamicschemas.management.cattle.io
    2. 刪除 cert-manager 和 cattle-system namespace
    3. 重新安裝 rancher
  • tech/rancher_tips.txt
  • 上一次變更: 2021/09/10 08:35
  • jonathan