找回密码
 立即注册
首页 业界区 业界 Kubernetes集群的搭建与DevOps实践(下)- 部署实践篇 ...

Kubernetes集群的搭建与DevOps实践(下)- 部署实践篇

訾懵 昨天 12:35
本文将详细介绍生产级Kubernetes集群的搭建步骤、CI/CD流水线配置、监控部署和故障排查方法并提供可执行的命令和配置文件。
适合读者:运维工程师、DevOps工程师、想动手搭建K8s集群的技术人员
前置阅读:建议先阅读《架构设计篇》了解整体架构和技术选型
目录


  • 一、环境准备
  • 二、中间件部署
  • 三、Kubernetes集群搭建
  • 四、CI/CD流水线搭建
  • 五、应用部署实践
  • 六、监控与日志
  • 七、故障排查与问题解决
  • 八、总结与检查清单
一、环境准备

1.0 网络规划

网络规划是部署的第一步,合理的网络架构能确保安全隔离和高效通信。
1.0.1 VPC与子网规划

采用三层子网架构,实现网络隔离:
  1. VPC网络: 10.0.0.0/16
  2. ├── entry子网:      10.0.10.0/24  (公网入口层)
  3. ├── middleware子网: 10.0.20.0/24  (中间件层)
  4. └── k8s子网:        10.0.30.0/24  (应用服务层)
复制代码
子网CIDR用途部署服务器entry子网10.0.10.0/24公网入口、运维管理entry-01, jumpservermiddleware子网10.0.20.0/24中间件服务middleware-01k8s子网10.0.30.0/24K8s集群master-01~03, node-01~02K8s内部网络规划
网段CIDR用途Pod网段172.16.0.0/16Pod IP分配(Calico管理)Service网段10.96.0.0/12Service ClusterIP1.0.2 安全组配置

Entry子网安全组(公网入口)
方向端口来源/目标用途入站80, 4430.0.0.0/0Web访问入站10022运维IP白名单SSH管理(非标准端口)出站ALL0.0.0.0/0允许所有出站K8s子网安全组
方向端口来源/目标用途入站6443entry子网K8s API Server入站30080, 30443entry子网Ingress NodePort入站ALLk8s子网内部集群内通信入站ALL172.16.0.0/16Pod网络通信出站ALL0.0.0.0/0允许所有出站Middleware子网安全组
方向端口来源/目标用途入站3306, 6379, 8848等k8s子网中间件服务端口入站10022entry子网SSH管理出站ALL0.0.0.0/0允许所有出站1.0.3 服务器互访规则

graph LR    subgraph entry子网        Entry[Entry节点]        Jump[JumpServer]    end        subgraph middleware子网        MW[Middleware]    end        subgraph k8s子网        Master[K8s Master]        Worker[K8s Worker]    end        Internet((互联网)) --> |80/443| Entry    Entry --> |6443| Master    Entry --> |30080| Worker    Jump --> |10022| Master    Jump --> |10022| MW    Worker --> |3306/6379/8848| MW    Master --> |3306/6379/8848| MW1.1 服务器清单

角色主机名IP示例配置说明Entryentry-0110.0.10.102C/4GNginx + Squid代理Middlewaremiddleware-0110.0.20.108C/32GMySQL、Redis等K8s Masterk8s-master-0110.0.30.104C/8G控制平面K8s Masterk8s-master-0210.0.30.114C/8G控制平面K8s Masterk8s-master-0310.0.30.124C/8G控制平面K8s Workerk8s-node-0110.0.30.208C/32G工作节点K8s Workerk8s-node-0210.0.30.218C/32G工作节点JumpServerjumpserver10.0.10.204C/8G堡垒机1.2 基础设施初始化

在部署K8s之前,需要完成基础设施的初始化配置。
1.2.1 服务器基础配置

所有服务器执行
  1. #!/bin/bash
  2. # 服务器基础配置脚本
  3. # 1. 设置主机名(根据服务器角色修改)
  4. HOSTNAME="k8s-master-01"
  5. hostnamectl set-hostname $HOSTNAME
  6. echo "127.0.0.1 $HOSTNAME" >> /etc/hosts
  7. # 2. 时区配置
  8. timedatectl set-timezone Asia/Shanghai
  9. timedatectl set-ntp yes
  10. # 3. 内核参数优化
  11. cat > /etc/sysctl.d/local.conf << EOF
  12. # 文件描述符
  13. fs.file-max = 512000
  14. # TCP优化
  15. net.core.rmem_max = 67108864
  16. net.core.wmem_max = 67108864
  17. net.core.somaxconn = 4096
  18. net.ipv4.tcp_syncookies = 1
  19. net.ipv4.tcp_tw_reuse = 1
  20. net.ipv4.tcp_fin_timeout = 30
  21. net.ipv4.tcp_keepalive_time = 1200
  22. net.ipv4.ip_local_port_range = 10000 65000
  23. net.ipv4.tcp_max_syn_backlog = 4096
  24. # 开启BBR拥塞控制
  25. net.ipv4.tcp_congestion_control = bbr
  26. # 禁用IPv6
  27. net.ipv6.conf.all.disable_ipv6 = 1
  28. net.ipv6.conf.default.disable_ipv6 = 1
  29. EOF
  30. sysctl -p /etc/sysctl.d/local.conf
  31. # 4. 系统资源限制
  32. cat > /etc/security/limits.conf << EOF
  33. *         hard    nofile      512000
  34. *         soft    nofile      512000
  35. root      hard    nofile      512000
  36. root      soft    nofile      512000
  37. EOF
  38. # 5. SSH安全配置
  39. cat > /etc/ssh/sshd_config << EOF
  40. Include /etc/ssh/sshd_config.d/*.conf
  41. Port 60022
  42. PermitRootLogin prohibit-password
  43. PubkeyAuthentication yes
  44. PasswordAuthentication no
  45. ClientAliveInterval 60
  46. ClientAliveCountMax 5
  47. EOF
  48. systemctl restart sshd
复制代码
2.2 Docker Compose配置
  1. #!/bin/bash
  2. # 入口服务器Nginx配置
  3. # 1. 安装Nginx
  4. apt-get update
  5. apt-get install -y nginx
  6. # 2. 配置Nginx(支持stream模块用于TCP负载均衡)
  7. cat > /etc/nginx/nginx.conf << EOF
  8. user www-data;
  9. worker_processes auto;
  10. pid /run/nginx.pid;
  11. events {
  12.     worker_connections 20480;
  13.     multi_accept on;
  14. }
  15. # TCP负载均衡(用于K8s API Server)
  16. stream {   
  17.     include /data/nginx/stream-sites-enabled/*;
  18. }
  19. http {
  20.     sendfile on;
  21.     tcp_nopush on;
  22.     tcp_nodelay on;
  23.     keepalive_timeout 65;
  24.     client_max_body_size 0;
  25.    
  26.     include /etc/nginx/mime.types;
  27.     default_type application/octet-stream;
  28.    
  29.     log_format main '[\$time_local] \$remote_addr -> '
  30.                     '"\$request" \$status \$body_bytes_sent '
  31.                     '"\$http_user_agent" \$request_time';
  32.    
  33.     access_log /data/nginx/logs/access.log main;
  34.     error_log /data/nginx/logs/error.log;
  35.    
  36.     gzip on;
  37.     gzip_types text/plain text/css application/json application/javascript;
  38.    
  39.     include /data/nginx/sites-enabled/*;
  40. }
  41. EOF
  42. # 3. 创建目录结构
  43. mkdir -p /data/nginx/{stream-sites-enabled,logs,sites-enabled,conf.d}
  44. chown -R www-data:www-data /data/nginx
复制代码
2.3 MySQL优化配置
  1. # K8s API Server TCP负载均衡(6443端口)
  2. cat > /data/nginx/stream-sites-enabled/k8s-apiserver.conf << EOF
  3. upstream k8s-apiserver {
  4.     server 10.0.30.10:6443 max_fails=3 fail_timeout=30s;
  5.     server 10.0.30.11:6443 max_fails=3 fail_timeout=30s;
  6.     server 10.0.30.12:6443 max_fails=3 fail_timeout=30s;
  7. }
  8. server {
  9.     listen 6443;
  10.     proxy_pass k8s-apiserver;
  11.     proxy_timeout 3s;
  12.     proxy_connect_timeout 1s;
  13. }
  14. EOF
复制代码
2.4 启动中间件
  1. # Ingress节点HTTP负载均衡
  2. cat > /data/nginx/conf.d/k8s-ingress.conf << EOF
  3. upstream ingress_nodes {
  4.     server 10.0.30.20:30080;
  5.     server 10.0.30.21:30080;
  6. }
  7. EOF
  8. # 应用站点配置示例
  9. cat > /data/nginx/sites-enabled/app.conf << EOF
  10. server {
  11.     listen 80;
  12.     server_name app.example.com;
  13.    
  14.     location / {
  15.         proxy_pass http://ingress_nodes;
  16.         proxy_set_header Host \$host;
  17.         proxy_set_header X-Real-IP \$remote_addr;
  18.         proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
  19.     }
  20. }
  21. EOF
  22. systemctl reload nginx
复制代码
  1. #!/bin/bash
  2. # JumpServer一键部署
  3. # 使用官方脚本快速部署
  4. curl -sSL https://resource.fit2cloud.com/jumpserver/jumpserver/releases/download/v3.10.17/quick_start.sh | bash
  5. # 修改配置(可选)
  6. # vim /opt/jumpserver/config/config.txt
  7. # 常用配置项:
  8. # - HTTP_PORT=80
  9. # - HTTPS_PORT=443
  10. # - DOMAINS="jumpserver.example.com"
  11. # 重启服务
  12. cd /opt/jumpserver-installer-v3.10.17
  13. ./jmsctl.sh restart
复制代码
  1. #!/bin/bash
  2. # Docker引擎安装与配置
  3. # 1. 安装依赖
  4. apt-get update
  5. apt-get install -y ca-certificates curl gnupg lsb-release
  6. # 2. 添加Docker官方GPG密钥(使用阿里云镜像)
  7. curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
  8.     gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
  9. # 3. 添加Docker仓库
  10. echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] \
  11.     https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
  12.     tee /etc/apt/sources.list.d/docker.list
  13. # 4. 安装Docker
  14. apt-get update
  15. apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
  16. # 5. 配置Docker
  17. mkdir -p /data/docker
  18. cat > /etc/docker/daemon.json << EOF
  19. {
  20.   "log-driver": "json-file",
  21.   "log-opts": {
  22.     "max-size": "100m",
  23.     "max-file": "3"
  24.   },
  25.   # 设置代理(可选)
  26.   "registry-mirrors": [
  27.     "https://docker.m.daocloud.io"
  28.   ],
  29.   # docker数据目录
  30.   "data-root": "/data/docker"
  31. }
  32. EOF
  33. # 6. 启动Docker
  34. systemctl enable docker
  35. systemctl restart docker
  36. # 7. 验证安装
  37. docker info
  38. docker compose version
复制代码
5.2 Ingress路由配置
  1. #!/bin/bash
  2. # 服务器安全加固
  3. # 1. 安装fail2ban防暴力破解
  4. apt-get install -y fail2ban
  5. # 2. 配置fail2ban
  6. cat > /etc/fail2ban/jail.local << EOF
  7. [DEFAULT]
  8. ignoreip = 127.0.0.1/8 ::1
  9. bantime = 3600
  10. maxretry = 3
  11. findtime = 600
  12. banaction = iptables-multiport
  13. [sshd]
  14. enabled = true
  15. port = 10022
  16. logpath = /var/log/auth.log
  17. maxretry = 3
  18. bantime = 3600
  19. EOF
  20. # 3. 启动fail2ban
  21. systemctl enable fail2ban
  22. systemctl restart fail2ban
  23. # 4. 查看状态
  24. fail2ban-client status sshd
复制代码
5.3 ConfigMap和Secret使用
  1. # 立即关闭
  2. swapoff -a
  3. # 永久关闭:删除fstab中的swap行
  4. sed -i '/swap/d' /etc/fstab
复制代码
在Deployment中引用:
  1. cat > /etc/modules-load.d/k8s.conf << EOF
  2. overlay        # OverlayFS文件系统
  3. br_netfilter   # 网桥过滤
  4. EOF
  5. modprobe overlay
  6. modprobe br_netfilter
  7. # 验证
  8. lsmod | grep -E "overlay|br_netfilter"
复制代码
六、监控与日志

6.1 Prometheus + Grafana部署

6.1.1 部署Node Exporter
  1. cat > /etc/sysctl.d/k8s.conf << EOF
  2. # K8s必需参数
  3. net.ipv4.ip_forward = 1
  4. net.bridge.bridge-nf-call-iptables = 1
  5. net.bridge.bridge-nf-call-ip6tables = 1
  6. # 连接跟踪优化
  7. net.netfilter.nf_conntrack_max = 524288
  8. # TCP优化
  9. net.ipv4.tcp_keepalive_time = 600
  10. net.ipv4.tcp_keepalive_intvl = 30
  11. net.core.somaxconn = 32768
  12. # 文件描述符
  13. fs.file-max = 2097152
  14. EOF
  15. sysctl --system
复制代码
6.1.2 Prometheus配置
  1. cat >> /etc/security/limits.conf << EOF
  2. # Kubernetes resource limits
  3. * soft nofile 655360
  4. * hard nofile 655360
  5. * soft nproc 655360
  6. * hard nproc 655360
  7. EOF
复制代码
6.2 告警规则示例
  1. # 添加Docker镜像源(containerd包含在其中)
  2. curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
  3.   gpg --dearmor -o /etc/apt/keyrings/docker.gpg
  4. echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  5.   https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
  6.   tee /etc/apt/sources.list.d/docker.list
  7. apt-get update
  8. apt-get install -y containerd.io
复制代码
6.3 日志收集方案

使用Filebeat收集容器日志到Elasticsearch:
  1. mkdir -p /etc/containerd
  2. cat > /etc/containerd/config.toml << 'EOF'
  3. version = 2
  4. [plugins."io.containerd.grpc.v1.cri"]
  5.   # 使用国内镜像
  6.   sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
  7.   
  8.   [plugins."io.containerd.grpc.v1.cri".containerd]
  9.     [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  10.       runtime_type = "io.containerd.runc.v2"
  11.       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  12.         SystemdCgroup = true  # 使用systemd作为cgroup驱动
  13.   
  14.   [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
  15.     [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  16.       endpoint = ["https://docker.m.daocloud.io"]
  17.     [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
  18.       endpoint = ["https://k8s.m.daocloud.io"]
  19. EOF
  20. systemctl daemon-reload
  21. systemctl restart containerd
  22. systemctl enable containerd
复制代码
七、故障排查与问题解决

7.1 常见问题速查表

问题现象可能原因排查命令解决方案Pod一直Pending资源不足kubectl describe pod 增加节点或调整资源请求Pod CrashLoopBackOff应用启动失败kubectl logs 检查应用配置和依赖ImagePullBackOff镜像拉取失败kubectl describe pod 检查镜像地址和凭证Service无法访问Endpoints为空kubectl get endpoints 检查Pod标签和selectorIngress 502后端Pod未就绪kubectl get pods检查readinessProbeOOMKilled内存不足kubectl describe pod 增加内存限制节点NotReady网络或kubelet问题kubectl describe node 检查kubelet和网络插件DNS解析失败CoreDNS问题kubectl logs -n kube-system -l k8s-app=kube-dns重启CoreDNS7.2 排查命令速查
  1. # 添加阿里云Kubernetes源
  2. curl -fsSL https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | \
  3.   gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
  4. echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] \
  5.   https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main' | \
  6.   tee /etc/apt/sources.list.d/kubernetes.list
  7. # 安装指定版本
  8. apt-get update
  9. apt-get install -y kubelet=1.28.6-1.1 kubeadm=1.28.6-1.1 kubectl=1.28.6-1.1
  10. # 锁定版本,防止意外升级
  11. apt-mark hold kubelet kubeadm kubectl
  12. # 启用kubelet
  13. systemctl enable kubelet
复制代码
7.3 典型案例分析

案例1:Pod频繁OOMKilled

现象:Pod每隔几小时重启,状态显示OOMKilled
排查
  1. apt-get install -y chrony
  2. cat > /etc/chrony/chrony.conf << 'EOF'
  3. server ntp.aliyun.com iburst
  4. server ntp.tencent.com iburst
  5. driftfile /var/lib/chrony/chrony.drift
  6. makestep 1.0 3
  7. rtcsync
  8. EOF
  9. systemctl restart chrony
  10. systemctl enable chrony
复制代码
原因:JVM堆内存设置与K8s限制不匹配
解决
  1. cat >> /etc/hosts << EOF
  2. 10.0.30.10  k8s-master-01
  3. 10.0.30.11  k8s-master-02
  4. 10.0.30.12  k8s-master-03
  5. 10.0.30.20  k8s-node-01
  6. 10.0.30.21  k8s-node-02
  7. 10.0.10.10  k8s-api-lb
  8. EOF
复制代码
案例2:跨节点Pod通信504

现象:同节点Pod通信正常,跨节点返回504超时,失败率约50%
快速排查
  1. # 热数据盘(SSD):MySQL、Redis
  2. mkdir -p /data/hot
  3. mount /dev/vdb1 /data/hot
  4. # 冷数据盘(HDD):Elasticsearch、MinIO
  5. mkdir -p /data/cold
  6. mount /dev/vdc1 /data/cold
  7. # 写入fstab自动挂载
  8. echo '/dev/vdb1 /data/hot ext4 defaults 0 0' >> /etc/fstab
  9. echo '/dev/vdc1 /data/cold ext4 defaults 0 0' >> /etc/fstab
复制代码
根本原因:云平台安全组只允许了节点网络,未允许Pod网络CIDR
解决:在安全组添加规则:

  • 入站:ANY - Pod网络CIDR(如172.16.0.0/16)
  • 出站:ANY - Pod网络CIDR(如172.16.0.0/16)
详细案例分析:参见《故障排查实战》篇
八、总结与检查清单

8.1 部署前检查清单

检查项命令/操作预期结果系统时间同步timedatectlSystem clock synchronized: yesSwap已关闭free -hSwap行全为0内核模块已加载lsmod | grep br_netfilter有输出containerd运行正常systemctl status containerdactive (running)kubelet已启用systemctl is-enabled kubeletenabled网络连通性节点间ping测试全部通镜像源可访问crictl pull nginx成功拉取8.2 部署后验证清单

检查项命令预期结果所有节点Readykubectl get nodesSTATUS全为Ready系统Pod正常kubectl get pods -n kube-system全为Running网络插件正常kubectl get pods -n calico-system全为RunningDNS解析正常kubectl run test --rm -it --image=busybox -- nslookup kubernetes解析成功跨节点通信创建两个Pod,互相ping通信正常Ingress工作创建测试Ingress并访问正常响应8.3 常用命令速查
  1. # docker-compose.yml
  2. version: '3'
  3. services:
  4.   mysql:
  5.     image: mysql:8.0
  6.     restart: always
  7.     ports:
  8.       - 3306:3306
  9.     volumes:
  10.       - /data/hot/mysql:/var/lib/mysql
  11.       - ./config/my.cnf:/etc/mysql/conf.d/my.cnf
  12.     environment:
  13.       MYSQL_ROOT_PASSWORD: ${MYSQL_PASSWORD}
  14.       TZ: Asia/Shanghai
  15.     networks:
  16.       - middleware
  17.   redis:
  18.     image: redis:7.2
  19.     restart: always
  20.     ports:
  21.       - 6379:6379
  22.     volumes:
  23.       - /data/hot/redis:/data
  24.     command: redis-server --requirepass ${REDIS_PASSWORD} --appendonly yes
  25.     networks:
  26.       - middleware
  27.   nacos:
  28.     image: nacos/nacos-server:v2.3.2
  29.     restart: always
  30.     depends_on:
  31.       - mysql
  32.     environment:
  33.       MODE: standalone
  34.       NACOS_AUTH_ENABLE: "true"
  35.       SPRING_DATASOURCE_PLATFORM: mysql
  36.       MYSQL_SERVICE_HOST: mysql
  37.       MYSQL_SERVICE_DB_NAME: nacos
  38.       MYSQL_SERVICE_USER: root
  39.       MYSQL_SERVICE_PASSWORD: ${MYSQL_PASSWORD}
  40.     ports:
  41.       - 8848:8848
  42.       - 9848:9848
  43.     networks:
  44.       - middleware
  45.   rabbitmq:
  46.     image: rabbitmq:3.12-management
  47.     restart: always
  48.     ports:
  49.       - 5672:5672
  50.       - 15672:15672
  51.     environment:
  52.       RABBITMQ_DEFAULT_USER: admin
  53.       RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
  54.     volumes:
  55.       - /data/hot/rabbitmq:/var/lib/rabbitmq
  56.     networks:
  57.       - middleware
  58.   elasticsearch:
  59.     image: elasticsearch:7.17.19
  60.     restart: always
  61.     volumes:
  62.       - /data/cold/elasticsearch:/usr/share/elasticsearch/data
  63.     environment:
  64.       discovery.type: single-node
  65.       ES_JAVA_OPTS: -Xms2g -Xmx2g
  66.     ports:
  67.       - 9200:9200
  68.     networks:
  69.       - middleware
  70. networks:
  71.   middleware:
  72.     driver: bridge
复制代码
8.4 关键配置文件位置

配置项路径kubeadm配置/etc/kubernetes/admin.confkubelet配置/var/lib/kubelet/config.yamlcontainerd配置/etc/containerd/config.tomlCalico配置kubectl get installation default -o yamlkubectl配置~/.kube/config关键词: Kubernetes、部署实践、CI/CD、监控、故障排查

来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!

相关推荐

您需要登录后才可以回帖 登录 | 立即注册