本文将详细介绍生产级Kubernetes集群的搭建步骤、CI/CD流水线配置、监控部署和故障排查方法并提供可执行的命令和配置文件。
适合读者:运维工程师、DevOps工程师、想动手搭建K8s集群的技术人员
前置阅读:建议先阅读《架构设计篇》了解整体架构和技术选型
目录
- 一、环境准备
- 二、中间件部署
- 三、Kubernetes集群搭建
- 四、CI/CD流水线搭建
- 五、应用部署实践
- 六、监控与日志
- 七、故障排查与问题解决
- 八、总结与检查清单
一、环境准备
1.0 网络规划
网络规划是部署的第一步,合理的网络架构能确保安全隔离和高效通信。
1.0.1 VPC与子网规划
采用三层子网架构,实现网络隔离:- VPC网络: 10.0.0.0/16
- ├── entry子网: 10.0.10.0/24 (公网入口层)
- ├── middleware子网: 10.0.20.0/24 (中间件层)
- └── k8s子网: 10.0.30.0/24 (应用服务层)
复制代码 子网CIDR用途部署服务器entry子网10.0.10.0/24公网入口、运维管理entry-01, jumpservermiddleware子网10.0.20.0/24中间件服务middleware-01k8s子网10.0.30.0/24K8s集群master-01~03, node-01~02K8s内部网络规划:
网段CIDR用途Pod网段172.16.0.0/16Pod IP分配(Calico管理)Service网段10.96.0.0/12Service ClusterIP1.0.2 安全组配置
Entry子网安全组(公网入口):
方向端口来源/目标用途入站80, 4430.0.0.0/0Web访问入站10022运维IP白名单SSH管理(非标准端口)出站ALL0.0.0.0/0允许所有出站K8s子网安全组:
方向端口来源/目标用途入站6443entry子网K8s API Server入站30080, 30443entry子网Ingress NodePort入站ALLk8s子网内部集群内通信入站ALL172.16.0.0/16Pod网络通信出站ALL0.0.0.0/0允许所有出站Middleware子网安全组:
方向端口来源/目标用途入站3306, 6379, 8848等k8s子网中间件服务端口入站10022entry子网SSH管理出站ALL0.0.0.0/0允许所有出站1.0.3 服务器互访规则
graph LR subgraph entry子网 Entry[Entry节点] Jump[JumpServer] end subgraph middleware子网 MW[Middleware] end subgraph k8s子网 Master[K8s Master] Worker[K8s Worker] end Internet((互联网)) --> |80/443| Entry Entry --> |6443| Master Entry --> |30080| Worker Jump --> |10022| Master Jump --> |10022| MW Worker --> |3306/6379/8848| MW Master --> |3306/6379/8848| MW1.1 服务器清单
角色主机名IP示例配置说明Entryentry-0110.0.10.102C/4GNginx + Squid代理Middlewaremiddleware-0110.0.20.108C/32GMySQL、Redis等K8s Masterk8s-master-0110.0.30.104C/8G控制平面K8s Masterk8s-master-0210.0.30.114C/8G控制平面K8s Masterk8s-master-0310.0.30.124C/8G控制平面K8s Workerk8s-node-0110.0.30.208C/32G工作节点K8s Workerk8s-node-0210.0.30.218C/32G工作节点JumpServerjumpserver10.0.10.204C/8G堡垒机1.2 基础设施初始化
在部署K8s之前,需要完成基础设施的初始化配置。
1.2.1 服务器基础配置
所有服务器执行:- #!/bin/bash
- # 服务器基础配置脚本
- # 1. 设置主机名(根据服务器角色修改)
- HOSTNAME="k8s-master-01"
- hostnamectl set-hostname $HOSTNAME
- echo "127.0.0.1 $HOSTNAME" >> /etc/hosts
- # 2. 时区配置
- timedatectl set-timezone Asia/Shanghai
- timedatectl set-ntp yes
- # 3. 内核参数优化
- cat > /etc/sysctl.d/local.conf << EOF
- # 文件描述符
- fs.file-max = 512000
- # TCP优化
- net.core.rmem_max = 67108864
- net.core.wmem_max = 67108864
- net.core.somaxconn = 4096
- net.ipv4.tcp_syncookies = 1
- net.ipv4.tcp_tw_reuse = 1
- net.ipv4.tcp_fin_timeout = 30
- net.ipv4.tcp_keepalive_time = 1200
- net.ipv4.ip_local_port_range = 10000 65000
- net.ipv4.tcp_max_syn_backlog = 4096
- # 开启BBR拥塞控制
- net.ipv4.tcp_congestion_control = bbr
- # 禁用IPv6
- net.ipv6.conf.all.disable_ipv6 = 1
- net.ipv6.conf.default.disable_ipv6 = 1
- EOF
- sysctl -p /etc/sysctl.d/local.conf
- # 4. 系统资源限制
- cat > /etc/security/limits.conf << EOF
- * hard nofile 512000
- * soft nofile 512000
- root hard nofile 512000
- root soft nofile 512000
- EOF
- # 5. SSH安全配置
- cat > /etc/ssh/sshd_config << EOF
- Include /etc/ssh/sshd_config.d/*.conf
- Port 60022
- PermitRootLogin prohibit-password
- PubkeyAuthentication yes
- PasswordAuthentication no
- ClientAliveInterval 60
- ClientAliveCountMax 5
- EOF
- systemctl restart sshd
复制代码 2.2 Docker Compose配置
- #!/bin/bash
- # 入口服务器Nginx配置
- # 1. 安装Nginx
- apt-get update
- apt-get install -y nginx
- # 2. 配置Nginx(支持stream模块用于TCP负载均衡)
- cat > /etc/nginx/nginx.conf << EOF
- user www-data;
- worker_processes auto;
- pid /run/nginx.pid;
- events {
- worker_connections 20480;
- multi_accept on;
- }
- # TCP负载均衡(用于K8s API Server)
- stream {
- include /data/nginx/stream-sites-enabled/*;
- }
- http {
- sendfile on;
- tcp_nopush on;
- tcp_nodelay on;
- keepalive_timeout 65;
- client_max_body_size 0;
-
- include /etc/nginx/mime.types;
- default_type application/octet-stream;
-
- log_format main '[\$time_local] \$remote_addr -> '
- '"\$request" \$status \$body_bytes_sent '
- '"\$http_user_agent" \$request_time';
-
- access_log /data/nginx/logs/access.log main;
- error_log /data/nginx/logs/error.log;
-
- gzip on;
- gzip_types text/plain text/css application/json application/javascript;
-
- include /data/nginx/sites-enabled/*;
- }
- EOF
- # 3. 创建目录结构
- mkdir -p /data/nginx/{stream-sites-enabled,logs,sites-enabled,conf.d}
- chown -R www-data:www-data /data/nginx
复制代码 2.3 MySQL优化配置
- # K8s API Server TCP负载均衡(6443端口)
- cat > /data/nginx/stream-sites-enabled/k8s-apiserver.conf << EOF
- upstream k8s-apiserver {
- server 10.0.30.10:6443 max_fails=3 fail_timeout=30s;
- server 10.0.30.11:6443 max_fails=3 fail_timeout=30s;
- server 10.0.30.12:6443 max_fails=3 fail_timeout=30s;
- }
- server {
- listen 6443;
- proxy_pass k8s-apiserver;
- proxy_timeout 3s;
- proxy_connect_timeout 1s;
- }
- EOF
复制代码 2.4 启动中间件
- # Ingress节点HTTP负载均衡
- cat > /data/nginx/conf.d/k8s-ingress.conf << EOF
- upstream ingress_nodes {
- server 10.0.30.20:30080;
- server 10.0.30.21:30080;
- }
- EOF
- # 应用站点配置示例
- cat > /data/nginx/sites-enabled/app.conf << EOF
- server {
- listen 80;
- server_name app.example.com;
-
- location / {
- proxy_pass http://ingress_nodes;
- proxy_set_header Host \$host;
- proxy_set_header X-Real-IP \$remote_addr;
- proxy_set_header X-Forwarded-For \$proxy_add_x_forwarded_for;
- }
- }
- EOF
- systemctl reload nginx
复制代码- #!/bin/bash
- # JumpServer一键部署
- # 使用官方脚本快速部署
- curl -sSL https://resource.fit2cloud.com/jumpserver/jumpserver/releases/download/v3.10.17/quick_start.sh | bash
- # 修改配置(可选)
- # vim /opt/jumpserver/config/config.txt
- # 常用配置项:
- # - HTTP_PORT=80
- # - HTTPS_PORT=443
- # - DOMAINS="jumpserver.example.com"
- # 重启服务
- cd /opt/jumpserver-installer-v3.10.17
- ./jmsctl.sh restart
复制代码- #!/bin/bash
- # Docker引擎安装与配置
- # 1. 安装依赖
- apt-get update
- apt-get install -y ca-certificates curl gnupg lsb-release
- # 2. 添加Docker官方GPG密钥(使用阿里云镜像)
- curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
- gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
- # 3. 添加Docker仓库
- echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] \
- https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
- tee /etc/apt/sources.list.d/docker.list
- # 4. 安装Docker
- apt-get update
- apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
- # 5. 配置Docker
- mkdir -p /data/docker
- cat > /etc/docker/daemon.json << EOF
- {
- "log-driver": "json-file",
- "log-opts": {
- "max-size": "100m",
- "max-file": "3"
- },
- # 设置代理(可选)
- "registry-mirrors": [
- "https://docker.m.daocloud.io"
- ],
- # docker数据目录
- "data-root": "/data/docker"
- }
- EOF
- # 6. 启动Docker
- systemctl enable docker
- systemctl restart docker
- # 7. 验证安装
- docker info
- docker compose version
复制代码 5.2 Ingress路由配置
- #!/bin/bash
- # 服务器安全加固
- # 1. 安装fail2ban防暴力破解
- apt-get install -y fail2ban
- # 2. 配置fail2ban
- cat > /etc/fail2ban/jail.local << EOF
- [DEFAULT]
- ignoreip = 127.0.0.1/8 ::1
- bantime = 3600
- maxretry = 3
- findtime = 600
- banaction = iptables-multiport
- [sshd]
- enabled = true
- port = 10022
- logpath = /var/log/auth.log
- maxretry = 3
- bantime = 3600
- EOF
- # 3. 启动fail2ban
- systemctl enable fail2ban
- systemctl restart fail2ban
- # 4. 查看状态
- fail2ban-client status sshd
复制代码 5.3 ConfigMap和Secret使用
- # 立即关闭
- swapoff -a
- # 永久关闭:删除fstab中的swap行
- sed -i '/swap/d' /etc/fstab
复制代码 在Deployment中引用:- cat > /etc/modules-load.d/k8s.conf << EOF
- overlay # OverlayFS文件系统
- br_netfilter # 网桥过滤
- EOF
- modprobe overlay
- modprobe br_netfilter
- # 验证
- lsmod | grep -E "overlay|br_netfilter"
复制代码 六、监控与日志
6.1 Prometheus + Grafana部署
6.1.1 部署Node Exporter
- cat > /etc/sysctl.d/k8s.conf << EOF
- # K8s必需参数
- net.ipv4.ip_forward = 1
- net.bridge.bridge-nf-call-iptables = 1
- net.bridge.bridge-nf-call-ip6tables = 1
- # 连接跟踪优化
- net.netfilter.nf_conntrack_max = 524288
- # TCP优化
- net.ipv4.tcp_keepalive_time = 600
- net.ipv4.tcp_keepalive_intvl = 30
- net.core.somaxconn = 32768
- # 文件描述符
- fs.file-max = 2097152
- EOF
- sysctl --system
复制代码 6.1.2 Prometheus配置
- cat >> /etc/security/limits.conf << EOF
- # Kubernetes resource limits
- * soft nofile 655360
- * hard nofile 655360
- * soft nproc 655360
- * hard nproc 655360
- EOF
复制代码 6.2 告警规则示例
- # 添加Docker镜像源(containerd包含在其中)
- curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | \
- gpg --dearmor -o /etc/apt/keyrings/docker.gpg
- echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
- https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable" | \
- tee /etc/apt/sources.list.d/docker.list
- apt-get update
- apt-get install -y containerd.io
复制代码 6.3 日志收集方案
使用Filebeat收集容器日志到Elasticsearch:- mkdir -p /etc/containerd
- cat > /etc/containerd/config.toml << 'EOF'
- version = 2
- [plugins."io.containerd.grpc.v1.cri"]
- # 使用国内镜像
- sandbox_image = "registry.aliyuncs.com/google_containers/pause:3.9"
-
- [plugins."io.containerd.grpc.v1.cri".containerd]
- [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
- runtime_type = "io.containerd.runc.v2"
- [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
- SystemdCgroup = true # 使用systemd作为cgroup驱动
-
- [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
- [plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
- endpoint = ["https://docker.m.daocloud.io"]
- [plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
- endpoint = ["https://k8s.m.daocloud.io"]
- EOF
- systemctl daemon-reload
- systemctl restart containerd
- systemctl enable containerd
复制代码 七、故障排查与问题解决
7.1 常见问题速查表
问题现象可能原因排查命令解决方案Pod一直Pending资源不足kubectl describe pod 增加节点或调整资源请求Pod CrashLoopBackOff应用启动失败kubectl logs 检查应用配置和依赖ImagePullBackOff镜像拉取失败kubectl describe pod 检查镜像地址和凭证Service无法访问Endpoints为空kubectl get endpoints 检查Pod标签和selectorIngress 502后端Pod未就绪kubectl get pods检查readinessProbeOOMKilled内存不足kubectl describe pod 增加内存限制节点NotReady网络或kubelet问题kubectl describe node 检查kubelet和网络插件DNS解析失败CoreDNS问题kubectl logs -n kube-system -l k8s-app=kube-dns重启CoreDNS7.2 排查命令速查
- # 添加阿里云Kubernetes源
- curl -fsSL https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | \
- gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
- echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] \
- https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main' | \
- tee /etc/apt/sources.list.d/kubernetes.list
- # 安装指定版本
- apt-get update
- apt-get install -y kubelet=1.28.6-1.1 kubeadm=1.28.6-1.1 kubectl=1.28.6-1.1
- # 锁定版本,防止意外升级
- apt-mark hold kubelet kubeadm kubectl
- # 启用kubelet
- systemctl enable kubelet
复制代码 7.3 典型案例分析
案例1:Pod频繁OOMKilled
现象:Pod每隔几小时重启,状态显示OOMKilled
排查:- apt-get install -y chrony
- cat > /etc/chrony/chrony.conf << 'EOF'
- server ntp.aliyun.com iburst
- server ntp.tencent.com iburst
- driftfile /var/lib/chrony/chrony.drift
- makestep 1.0 3
- rtcsync
- EOF
- systemctl restart chrony
- systemctl enable chrony
复制代码 原因:JVM堆内存设置与K8s限制不匹配
解决:- cat >> /etc/hosts << EOF
- 10.0.30.10 k8s-master-01
- 10.0.30.11 k8s-master-02
- 10.0.30.12 k8s-master-03
- 10.0.30.20 k8s-node-01
- 10.0.30.21 k8s-node-02
- 10.0.10.10 k8s-api-lb
- EOF
复制代码 案例2:跨节点Pod通信504
现象:同节点Pod通信正常,跨节点返回504超时,失败率约50%
快速排查:- # 热数据盘(SSD):MySQL、Redis
- mkdir -p /data/hot
- mount /dev/vdb1 /data/hot
- # 冷数据盘(HDD):Elasticsearch、MinIO
- mkdir -p /data/cold
- mount /dev/vdc1 /data/cold
- # 写入fstab自动挂载
- echo '/dev/vdb1 /data/hot ext4 defaults 0 0' >> /etc/fstab
- echo '/dev/vdc1 /data/cold ext4 defaults 0 0' >> /etc/fstab
复制代码 根本原因:云平台安全组只允许了节点网络,未允许Pod网络CIDR
解决:在安全组添加规则:
- 入站:ANY - Pod网络CIDR(如172.16.0.0/16)
- 出站:ANY - Pod网络CIDR(如172.16.0.0/16)
详细案例分析:参见《故障排查实战》篇
八、总结与检查清单
8.1 部署前检查清单
检查项命令/操作预期结果系统时间同步timedatectlSystem clock synchronized: yesSwap已关闭free -hSwap行全为0内核模块已加载lsmod | grep br_netfilter有输出containerd运行正常systemctl status containerdactive (running)kubelet已启用systemctl is-enabled kubeletenabled网络连通性节点间ping测试全部通镜像源可访问crictl pull nginx成功拉取8.2 部署后验证清单
检查项命令预期结果所有节点Readykubectl get nodesSTATUS全为Ready系统Pod正常kubectl get pods -n kube-system全为Running网络插件正常kubectl get pods -n calico-system全为RunningDNS解析正常kubectl run test --rm -it --image=busybox -- nslookup kubernetes解析成功跨节点通信创建两个Pod,互相ping通信正常Ingress工作创建测试Ingress并访问正常响应8.3 常用命令速查
- # docker-compose.yml
- version: '3'
- services:
- mysql:
- image: mysql:8.0
- restart: always
- ports:
- - 3306:3306
- volumes:
- - /data/hot/mysql:/var/lib/mysql
- - ./config/my.cnf:/etc/mysql/conf.d/my.cnf
- environment:
- MYSQL_ROOT_PASSWORD: ${MYSQL_PASSWORD}
- TZ: Asia/Shanghai
- networks:
- - middleware
- redis:
- image: redis:7.2
- restart: always
- ports:
- - 6379:6379
- volumes:
- - /data/hot/redis:/data
- command: redis-server --requirepass ${REDIS_PASSWORD} --appendonly yes
- networks:
- - middleware
- nacos:
- image: nacos/nacos-server:v2.3.2
- restart: always
- depends_on:
- - mysql
- environment:
- MODE: standalone
- NACOS_AUTH_ENABLE: "true"
- SPRING_DATASOURCE_PLATFORM: mysql
- MYSQL_SERVICE_HOST: mysql
- MYSQL_SERVICE_DB_NAME: nacos
- MYSQL_SERVICE_USER: root
- MYSQL_SERVICE_PASSWORD: ${MYSQL_PASSWORD}
- ports:
- - 8848:8848
- - 9848:9848
- networks:
- - middleware
- rabbitmq:
- image: rabbitmq:3.12-management
- restart: always
- ports:
- - 5672:5672
- - 15672:15672
- environment:
- RABBITMQ_DEFAULT_USER: admin
- RABBITMQ_DEFAULT_PASS: ${RABBITMQ_PASSWORD}
- volumes:
- - /data/hot/rabbitmq:/var/lib/rabbitmq
- networks:
- - middleware
- elasticsearch:
- image: elasticsearch:7.17.19
- restart: always
- volumes:
- - /data/cold/elasticsearch:/usr/share/elasticsearch/data
- environment:
- discovery.type: single-node
- ES_JAVA_OPTS: -Xms2g -Xmx2g
- ports:
- - 9200:9200
- networks:
- - middleware
- networks:
- middleware:
- driver: bridge
复制代码 8.4 关键配置文件位置
配置项路径kubeadm配置/etc/kubernetes/admin.confkubelet配置/var/lib/kubelet/config.yamlcontainerd配置/etc/containerd/config.tomlCalico配置kubectl get installation default -o yamlkubectl配置~/.kube/config关键词: Kubernetes、部署实践、CI/CD、监控、故障排查
来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作! |