无法添加cloudpods 计算节点的问题排查

给cloudpods 添加一个计算节点，出错：

TASK [worker-node : Use 'ocadm join 192.168.x.x:6443 --token howsjs.XXX --discovery-token-unsafe-skip-ca-verification --enable-host-agent --node-ip 192.168.x.x --enable-hugepage '] ***
fatal: [192.168.x.x]: FAILED! =>
 {"changed": true, "cmd": ["/opt/yunion/bin/ocadm", "join", "192.168.x.x:6443", "--token", "howsjs.XXXXXXX", "--discovery-token-unsafe-skip-ca-verification", "--enable-host-agent", "--node-ip", "192.168.8.6", "--enable-hugepage"], "delta": "0:05:00.667016", "end": "2023-07-21 15:08:32.106642", "msg": "non-zero return code", "rc": 1, "start": "2023-07-21 15:03:31.439626", "stderr": "\t[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 20.10.5. Latest validated version: 18.09\nerror execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s", "stderr_lines": ["\t[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 20.10.5. Latest validated version: 18.09", "error execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s"], "stdout": "[preflight] Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks"]}

和那个Docker version 异常没啥关系，但是这段错误也让人看不懂，API Server 是正常的。

error execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s

在ocboot 找到对应的role，将join 命令日志等级提升：

-command: "/opt/yunion/bin/ocadm {{ join_args }}"
+command: "/opt/yunion/bin/ocadm {{ join_args }} -v 5"

再执行add-node，输出多一些了：

.... [discovery] Failed to connect to API Server \"192.168.x.x:6443\": token id \"qqrnpr\" is invalid for this cluster or it has expired. Use \"kubeadm token create\" on the control-plane node to create a new valid token", "I0721 17:02:24.772935   37384 token.go:199] [discovery] Trying to connect to API Server \"192.168.x.x:6443\"", "I0721 17:02:24.773719   37384 token.go:74] [discovery] Created cluster-info discovery client, requesting info from \"https://192.168.8.8:6443\"", "I0721 17:02:24.776478   37384 token.go:202] [discovery] Failed to connect to API Server \"192.168.x.x:6443\": token id \"xxxx\" is invalid for this cluster or it has expired. Use \"kubeadm token create\" on the control-plane node to create a new valid token", "I0721 17:02:29.502523   37384 token.go:219] [discovery] abort connecting to API servers after timeout of 5m0s", "error execution phase preflight: couldn't validate the identity of the API Server: abort connecting to API servers after timeout of 5m0s"], "stdout": "[preflight] Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks

上计算节点看看kublet 服务状态systemctl status kubelet：

Jul 21 17:11:12 localhost kubelet: E0721 17:11:12.285975   44259 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:445: Failed to list *v1.Service: Unauthorized
Jul 21 17:11:12 localhost kubelet: E0721 17:11:12.289409   44259 kubelet.go:2252] node "sz-node-8-6" not found
Jul 21 17:11:12 localhost kubelet: E0721 17:11:12.389659   44259 kubelet.go:2252] node "sz-node-8-6" not found
Jul 21 17:11:12 localhost kubelet: E0721 17:11:12.460715   44259 file_linux.go:61] Unable to read config path "/etc/kubernetes/manifests": path does not exist, ignoring
Jul 21 17:11:12 localhost kubelet: E0721 17:11:12.486273   44259 reflector.go:125] k8s.io/kubernetes/pkg/kubelet/kubelet.go:454: Failed to list *v1.Node: Unauthorized
Jul 21 17:11:12 localhost kubelet: E0721 17:11:12.489897   44259 kubelet.go:2252] node "sz-node-8-6" not found
Jul 21 17:11:12 localhost kubelet: E0721 17:11:12.590123   44259 kubelet.go:2252] node "sz-node-8-6" not found

Node 无法注册成功，看下kublet 的配置在哪里：

[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/

[Service]
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

# /usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/sysconfig/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS

对比了几个配置，和其他计算节点没啥差别，都是重装过的机器部署的。

/etc/kubernetes/kubelet.conf

...
preferences: {}
users:
- name: default-auth
  user:
    client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
    client-key: /var/lib/kubelet/pki/kubelet-client-current.pem

检查下证书：

openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem  -noout -text

Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            6f:58:0f:41:fb:....
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN=kubernetes
        Validity
            Not Before: Jul 21 10:47:00 2023 GMT
            Not After : Jul 20 10:47:00 2024 GMT
        Subject: O=system:nodes, CN=system:node:sz

🤔发现Not Before 怎么是50分钟之后（GMT+8），检查所有master，发现有一台master 时间异常，修正这台master 时间。重新join，终于成功了。

奇怪的ntpd 进程怎么挂了，建议cloudpods 主机监控项里面增加一个时间监控？

fangpsh's blog