基础设施即代码(IaC)工作流

Claude Code 工作流专题 · 用代码管理基础设施

专题:Claude Code 工作流系统学习

关键词:Claude Code, IaC, Terraform, Pulumi, Ansible, GitOps, ArgoCD, Terraform State, 漂移检测

一、IaC框架选择与对比

基础设施即代码(Infrastructure as Code,IaC)是DevOps实践的核心支柱,它将基础设施的配置和管理与软件开发流程对齐。选择合适的IaC框架决定了团队的工作效率和基础设施的可维护性。目前主流的IaC框架包括声明式工具(Terraform、OpenTofu、CloudFormation、Pulumi)和配置管理工具(Ansible),以及镜像构建工具(Packer)。

主流IaC框架能力对比

框架类型语言状态管理多云支持适用场景
Terraform声明式HCL有(State)基础设施资源编排
OpenTofu声明式HCL有(State)Terraform开源替代
Pulumi声明式/命令式TS/Python/Go/C#有(State)用编程语言管理基础设施
CloudFormation声明式YAML/JSON有(AWS管理)否(仅AWS)AWS专属资源编排
CDK命令式TS/Python/Java/C#/Go有(CloudFormation)否(仅AWS)AWS基础设施编程
Ansible声明式(过程式执行)YAML无(幂等设计)配置管理和应用部署
Packer声明式HCL/JSON统一镜像构建

Terraform HCL基础示例

terraform { required_version = "~> 1.6" required_providers { aws = { source = "hashicorp/aws" version = "~> 5.0" } kubernetes = { source = "hashicorp/kubernetes" version = "~> 2.23" } } backend "s3" { bucket = "my-company-terraform-state" key = "infrastructure/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-state-lock" encrypt = true } }

Pulumi TypeScript示例

import * as aws from "@pulumi/aws"; import * as awsx from "@pulumi/awsx"; // 创建VPC const vpc = new awsx.ec2.Vpc("main-vpc", { cidrBlock: "10.0.0.0/16", numberOfAvailabilityZones: 3, numberOfNatGateways: 1, tags: { Name: "main-vpc", Environment: "production" }, }); // 创建ECS Fargate集群 const cluster = new aws.ecs.Cluster("app-cluster", { name: "app-cluster", tags: { Environment: "production" }, }); export const vpcId = vpc.vpcId; export const publicSubnetIds = vpc.publicSubnetIds;

Ansible Playbook示例

--- - name: Provision web servers hosts: webservers become: yes vars: app_port: 8080 app_version: "1.2.3" tasks: - name: Install Nginx apt: name: nginx state: present update_cache: yes - name: Configure Nginx reverse proxy template: src: nginx.conf.j2 dest: /etc/nginx/sites-available/app notify: restart nginx - name: Deploy application container docker_container: name: app image: "myregistry/app:{{ app_version }}" ports: "{{ app_port }}:8080" state: started restart_policy: always handlers: - name: restart nginx service: name: nginx state: restarted

框架选择建议:对于多云基础设施资源编排,推荐Terraform或OpenTofu;需要强类型检查时可选用Pulumi的TypeScript或Go;配置管理和应用部署推荐Ansible;AWS纯血环境可用CDK获取最佳开发体验。

二、资源定义与管理

IaC的核心能力是用代码定义云资源,包括网络、计算、存储、数据库、安全策略等。资源定义需遵循模块化、参数化、可复用三大原则。通过模块化设计,团队可将基础设施划分为可独立管理的逻辑单元,每个模块有明确的输入输出接口。

VPC网络拓扑定义

# 定义VPC、子网、安全组和NAT网关 resource "aws_vpc" "main" { cidr_block = "10.0.0.0/16" enable_dns_hostnames = true enable_dns_support = true tags = { Name = "main-vpc", Environment = var.environment } } resource "aws_subnet" "public" { count = 3 vpc_id = aws_vpc.main.id cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index) availability_zone = data.aws_availability_zones.available.names[count.index] map_public_ip_on_launch = true tags = { Name = "public-${count.index}", Tier = "public" } } resource "aws_security_group" "web_sg" { name = "web-security-group" description = "Allow HTTP/HTTPS ingress" vpc_id = aws_vpc.main.id ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } tags = { Name = "web-sg" } }

EC2与RDS资源定义

# 创建EC2实例和RDS数据库 resource "aws_instance" "app_server" { ami = data.aws_ami.amazon_linux_2.id instance_type = var.instance_type subnet_id = aws_subnet.public[0].id vpc_security_group_ids = [aws_security_group.web_sg.id] iam_instance_profile = aws_iam_instance_profile.app.name user_data = <<-EOF #!/bin/bash yum update -y yum install -y docker systemctl enable --now docker EOF root_block_device { volume_type = "gp3" volume_size = 100 encrypted = true } tags = { Name = "app-server" } } resource "aws_db_instance" "postgres" { identifier = "app-database" engine = "postgres" engine_version = "15.3" instance_class = "db.r6g.large" db_name = "appdb" username = "admin" password = random_password.db_password.result vpc_security_group_ids = [aws_security_group.database_sg.id] db_subnet_group_name = aws_db_subnet_group.private.name storage_encrypted = true backup_retention_period = 30 backup_window = "03:00-04:00" maintenance_window = "sun:05:00-sun:06:00" deletion_protection = true skip_final_snapshot = false tags = { Name = "app-postgres" } }

Lambda无服务器函数

# 创建Lambda函数并配置触发器 resource "aws_lambda_function" "api_handler" { filename = "handler.zip" function_name = "api-handler" role = aws_iam_role.lambda_exec.arn handler = "index.handler" runtime = "nodejs18.x" memory_size = 256 timeout = 30 publish = true environment { variables = { DB_HOST = aws_db_instance.postgres.address DB_PORT = aws_db_instance.postgres.port DB_NAME = aws_db_instance.postgres.db_name S3_BUCKET = aws_s3_bucket.assets.id } } tracing_config { mode = "Active" } tags = { Service = "api" } } resource "aws_lambda_permission" "api_gw" { statement_id = "AllowAPIGatewayInvoke" action = "lambda:InvokeFunction" function_name = aws_lambda_function.api_handler.function_name principal = "apigateway.amazonaws.com" source_arn = "${aws_api_gateway_rest_api.api.execution_arn}/*/*" }

S3存储桶与负载均衡器

# 创建S3存储桶(前端静态资源) resource "aws_s3_bucket" "assets" { bucket = "mycompany-assets-${var.environment}" tags = { Name = "assets" } } resource "aws_s3_bucket_versioning" "assets" { bucket = aws_s3_bucket.assets.id versioning_configuration { status = "Enabled" } } resource "aws_s3_bucket_public_access_block" "assets" { bucket = aws_s3_bucket.assets.id block_public_acls = true block_public_policy = true ignore_public_acls = true restrict_public_buckets = true } # 创建应用负载均衡器 resource "aws_lb" "web" { name = "web-alb" internal = false load_balancer_type = "application" security_groups = [aws_security_group.alb_sg.id] subnets = aws_subnet.public[*].id enable_deletion_protection = true tags = { Name = "web-alb" } } resource "aws_lb_target_group" "web" { name = "web-tg" port = 80 protocol = "HTTP" vpc_id = aws_vpc.main.id health_check { path = "/health" interval = 30 timeout = 5 healthy_threshold = 2 unhealthy_threshold = 3 } }

资源间依赖与状态传递

基础设施资源之间天然存在依赖关系,IaC工具通过隐式引用和显式depends_on来解析执行顺序。Terraform会分析所有资源引用关系自动构建依赖图,但某些时候需要显式声明依赖(如在对等连接中)。正确管理资源依赖是避免基础设施部署失败的关键。

# 隐式依赖:通过引用自动建立依赖 resource "aws_eip" "nat" { domain = "vpc" instance = aws_instance.nat.id # 隐式依赖aws_instance.nat } # 显式依赖:当Terraform无法自动推断时 resource "aws_vpc_peering_connection" "peer" { vpc_id = aws_vpc.main.id peer_vpc_id = aws_vpc.other.id auto_accept = true depends_on = [aws_vpc.main, aws_vpc.other, aws_subnet.public] } # 数据源:读取已有资源信息 data "aws_ami" "amazon_linux_2" { most_recent = true owners = ["amazon"] filter { name = "name" values = ["amzn2-ami-hvm-*-x86_64-gp2"] } }

Kubernetes集群与工作负载

# 创建EKS集群 resource "aws_eks_cluster" "main" { name = "main-eks" role_arn = aws_iam_role.eks_cluster.arn version = "1.27" vpc_config { subnet_ids = aws_subnet.private[*].id endpoint_private_access = true endpoint_public_access = false } enabled_cluster_log_types = ["api", "audit", "authenticator"] } resource "aws_eks_node_group" "general" { cluster_name = aws_eks_cluster.main.name node_group_name = "general" node_role_arn = aws_iam_role.eks_nodes.arn subnet_ids = aws_subnet.private[*].id instance_types = ["r6i.large"] scaling_config { desired_size = 3 min_size = 1 max_size = 10 } update_config { max_unavailable = 1 } tags = { Cluster = "main-eks" } }

三、配置管理

配置管理是IaC工作流中处理应用程序和环境差异化的关键环节。从环境变量到敏感信息(Secrets),再到动态运行时配置,一个好的配置管理体系应支持不同环境的差异化配置、配置的版本追溯、敏感信息的加密存储以及对运行时配置的实时变更。

环境变量与配置文件

在不同环境中(开发、测试、预发、生产)基础设施和应用配置通常存在差异。Terraform通过变量文件(.tfvars)和环境变量实现环境分离,避免硬编码。

# 开发环境 dev.tfvars environment = "dev" instance_type = "t3.medium" db_instance = "db.t4g.small" min_capacity = 1 max_capacity = 5 enable_https = false log_retention = 7 # 生产环境 prod.tfvars environment = "prod" instance_type = "r6i.xlarge" db_instance = "db.r6g.large" min_capacity = 3 max_capacity = 20 enable_https = true log_retention = 365

环境隔离最佳实践:使用单独的.tfvars文件管理不同环境配置;在CI/CD流水线中通过环境变量注入Secrets;配置文件和Secrets使用.gitignore排除,通过安全渠道分发。

Secrets管理与AWS参数存储

# 使用AWS Secrets Manager存储敏感信息 resource "aws_secretsmanager_secret" "db_credentials" { name = "db-credentials-${var.environment}" recovery_window_in_days = 7 rotation_rules { automatically_after_days = 90 } } resource "aws_secretsmanager_secret_version" "db_credentials" { secret_id = aws_secretsmanager_secret.db_credentials.id secret_string = jsonencode({ username = "admin" password = random_password.db.result host = aws_db_instance.postgres.address port = aws_db_instance.postgres.port }) } # 使用AWS SSM Parameter Store存储配置 resource "aws_ssm_parameter" "app_config" { name = "/app/${var.environment}/config" type = "SecureString" value = jsonencode({ feature_flags = { new_checkout = true, dark_mode = false } api_rate_limit = 1000 cache_ttl_seconds = 300 external_services = { payment_gateway_url = "https://pay.example.com" webhook_url = "https://hooks.example.com/events" } }) }

AppConfig动态配置

AWS AppConfig与LaunchDarkly等工具支持运行时动态配置下发,无需重新部署即可调整功能开关、限流阈值等参数。这在紧急变更和金丝雀发布场景中尤其重要。

# AWS AppConfig动态配置资源 resource "aws_appconfig_application" "app" { name = "my-app" } resource "aws_appconfig_configuration_profile" "feature_flags" { application_id = aws_appconfig_application.app.id name = "feature-flags" location_uri = "hosted" type = "AWS.AppConfig.FeatureFlags" validator { type = "JSON_SCHEMA" content = jsonencode({ "$schema" = "http://json-schema.org/draft-07/schema#" type = "object" }) } } resource "aws_appconfig_environment" "production" { application_id = aws_appconfig_application.app.id name = "production" monitors { alarm_arn = aws_cloudwatch_metric_alarm.app_error.arn alarm_role_arn = aws_iam_role.appconfig.arn } }

配置版本控制与审计

所有配置文件应纳入Git仓库管理,配合Terraform的plan输出形成完整的变更审计链。每次配置变更都应在CI/CD流水线中生成plan报告并归档。通过Git标签和版本号管理配置基线,在灾难恢复时可快速回滚到已知正确的配置版本。

配置管理核心原则:(1)代码与配置分离,敏感信息绝不硬编码;(2)环境差异化通过变量注入,而不是分支或复制代码;(3)配置变更需要审查和审批流程;(4)运行时配置优先于部署时配置,但要严格控制变更权限。

四、状态管理

Terraform State是IaC中最重要的概念之一。状态文件记录了实际部署的基础设施资源的元数据和依赖关系,是Terraform判断"真实世界"与"期望配置"之间差异的依据。缺乏正确的状态管理会导致资源漂移、并发冲突和数据丢失。

Terraform State详解

Terraform State将配置文件中声明的资源映射到云提供商中的真实资源ID。每个资源块都会被分配一个唯一的资源地址,state中保存了资源的属性值、依赖关系、元数据(如provider配置)等。State的格式为JSON,可直接查看但建议不要手动修改。

# 查看当前状态列表 terraform state list # 输出示例: data.aws_ami.amazon_linux_2 aws_vpc.main aws_subnet.public[0] aws_subnet.public[1] aws_subnet.public[2] aws_security_group.web_sg aws_instance.app_server aws_db_instance.postgres aws_lb.web # 查看特定资源状态详情 terraform state show aws_instance.app_server # 将资源从状态中移除(不移除实际资源) terraform state rm aws_instance.app_server # 将现有资源导入状态文件 terraform import aws_instance.app_server i-0abcd1234efgh5678

远程状态存储与状态锁定

生产环境必须使用远程后端存储状态文件,避免本地存储带来的丢失风险。主流后端包括AWS S3+DynamoDB、Azure Storage、GCS、Terraform Cloud等。远程后端配合状态锁定机制(如DynamoDB)防止并发操作导致的状态损坏。

# S3后端 + DynamoDB状态锁定 terraform { backend "s3" { bucket = "company-terraform-state" key = "network/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks" encrypt = true } } # DynamoDB锁表定义(需先手动创建或通过另一个backend创建) resource "aws_dynamodb_table" "terraform_locks" { name = "terraform-locks" billing_mode = "PAY_PER_REQUEST" hash_key = "LockID" attribute { name = "LockID" type = "S" } server_side_encryption { enabled = true } }

状态迁移方案

团队重构或后端切换时需要迁移状态文件。Terraform支持`terraform state mv`命令进行状态重命名和重构,`terraform state pull/push`在本地和后端之间同步状态。从本地迁移到远程后端时,只需修改backend配置后执行`terraform init -migrate -reconfigure`。

# 状态迁移:本地S3 → S3(不同桶) # 1. 修改backend配置指向新桶 # 2. 运行migrate命令 terraform init -migrate -reconfigure # 状态迁移:重命名资源(模块重构时) # 假设旧资源地址:module.legacy.aws_vpc.main # 新资源地址:module.network.aws_vpc.main terraform state mv module.legacy.aws_vpc.main module.network.aws_vpc.main # 迁移整个模块 terraform state mv -dry-run 'module.legacy' 'module.network' terraform state mv 'module.legacy' 'module.network'

状态恢复与灾难应对

重要:状态文件丢失后虽可重建(terraform import逐一导入),但过程极其繁琐且可能丢失资源间依赖关系。建议开启S3桶版本控制作为状态文件的多版本备份策略,设置生命周期规则自动清理旧版本。

# 启用S3桶版本控制保护状态文件 resource "aws_s3_bucket_versioning" "state_bucket" { bucket = aws_s3_bucket.terraform_state.id versioning_configuration { status = "Enabled" } } # 从S3版本恢复状态 # 1. 列出状态文件所有版本 aws s3api list-object-versions --bucket company-terraform-state --prefix network/terraform.tfstate # 2. 下载指定版本 aws s3api get-object --bucket company-terraform-state --key network/terraform.tfstate \ --version-id "VersionId" terraform.tfstate.recovered # 3. 推入当前backend terraform state push terraform.tfstate.recovered

导入现有资源

将手动创建的资源纳入IaC管理是常见的迁移场景。Terraform通过`terraform import`命令将现有资源映射到配置中的resource块。导入后需要执行`terraform plan`确保配置与真实资源一致。

# 配置文件中定义对应的resource块 resource "aws_s3_bucket" "existing_data" { bucket = "company-existing-data-bucket" } # 导入命令 terraform import aws_s3_bucket.existing_data company-existing-data-bucket # 批量导入辅助脚本 # 先执行plan查看资源属性差异,再调整配置对齐 terraform plan # 使用import块批量导入(Terraform 1.5+) import { to = aws_s3_bucket.existing_data id = "company-existing-data-bucket" }

五、CI/CD基础设施

基础设施的CI/CD流水线是IaC工作流自动化的核心。与传统应用CI/CD不同,基础设施流水线需要处理plan/apply两阶段、审批流程、状态文件安全、漂移检测等特殊需求。GitOps作为基础设施部署的先进模式,将Git仓库作为单一事实来源,通过自动化算子(Operator)实现声明式基础设施管理。

GitOps工作流架构

GitOps的核心思想是:Git仓库中保存基础设施的"期望状态",自动化算子持续检测并确保实际环境与期望状态一致。这消除了手动操作失误,提供了完整的变更审计能力。

# GitOps工作流示例:GitHub Actions + Terraform name: "Terraform Infrastructure Pipeline" on: push: branches: [main] paths: - "terraform/**" pull_request: branches: [main] paths: - "terraform/**" env: TF_VERSION: "1.6.0" TF_WORKING_DIR: "terraform" jobs: terraform_plan: name: "Terraform Plan" runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: "Setup Terraform" uses: hashicorp/setup-terraform@v2 with: terraform_version: ${{ env.TF_VERSION }} - name: "Terraform Init" run: terraform init working-directory: ${{ env.TF_WORKING_DIR }} - name: "Terraform Validate" run: terraform validate -no-color working-directory: ${{ env.TF_WORKING_DIR }} - name: "Terraform Plan" id: plan run: terraform plan -no-color -out=tfplan working-directory: ${{ env.TF_WORKING_DIR }} - name: "Post Plan Comment" if: github.event_name == 'pull_request' uses: actions/github-script@v6 with: script: | const output = `## Terraform Plan \`\`\` ${{ steps.plan.outputs.stdout }} \`\`\`` github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: output }) terraform_apply: name: "Terraform Apply" needs: [terraform_plan] if: github.ref == 'refs/heads/main' && github.event_name == 'push' runs-on: ubuntu-latest environment: production steps: - uses: actions/checkout@v4 - name: "Setup Terraform" uses: hashicorp/setup-terraform@v2 - name: "Terraform Init" run: terraform init working-directory: ${{ env.TF_WORKING_DIR }} - name: "Terraform Apply" run: terraform apply -auto-approve tfplan working-directory: ${{ env.TF_WORKING_DIR }}

ArgoCD与Kubernetes GitOps

ArgoCD是Kubernetes生态中最流行的GitOps工具,它持续监控Git仓库中的应用定义(Helm Charts、Kustomize或纯YAML),自动同步到目标集群。当Git仓库中的配置发生变更时,ArgoCD自动将集群状态调整为期望状态。

# ArgoCD Application定义 apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: production-infra namespace: argocd spec: project: default source: repoURL: "https://github.com/company/gitops-infra.git" targetRevision: main path: kubernetes/overlays/production helm: valueFiles: - values-prod.yaml destination: server: "https://kubernetes.default.svc" namespace: production syncPolicy: automated: prune: true # 自动删除Git中不存在的资源 selfHeal: true # 自动修复手动修改 syncOptions: - CreateNamespace=true - ApplyOutOfSyncOnly=true

Flux CD与基础设施流水线

Flux是CNCF孵化的另一个GitOps工具,与ArgoCD相比,Flux更轻量、与Kubernetes控制器模式集成更深。Flux可以管理Terraform资源、Helm Release和OCI工件。

# Flux Terraform Controller资源定义 apiVersion: infra.contrib.fluxcd.io/v1alpha2 kind: Terraform metadata: name: network-infra spec: interval: 1h path: ./terraform/network sourceRef: kind: GitRepository name: infra-repo approvePlan: auto destroyResourcesOnDeletion: true vars: - name: environment value: production backendConfig: disable: false customConfiguration: secretRef: name: tf-backend-config

计划审批与自动应用策略

推荐的分支策略:通过Git Flow或Trunk-Based开发模式管理IaC变更。每个PR生成Terraform Plan并作为PR评论展示。合并到main分支后自动执行Apply。生产环境的Apply应设置手动审批门禁,关键资源(数据库、网络配置)的变更需额外审批。

漂移检测机制

基础设施漂移是指实际云环境配置与IaC代码定义的状态产生差异。漂移可能由手动操作、自动缩放事件、API直接调用或第三方集成导致。定期漂移检测是保持IaC工作流可靠性的关键。

# Terraform漂移检测脚本(可定时执行) #!/bin/bash # drift_detection.sh - 定时检测基础设施漂移 ENVIRONMENTS=("dev" "staging" "production") for ENV in "${ENVIRONMENTS[@]}"; do echo "=== Checking $ENV environment ===" # 切换到对应目录 cd terraform/environments/$ENV # 初始化并刷新状态 terraform init -backend-config="backend-${ENV}.hcl" terraform refresh -no-color # 生成plan检测变更 PLAN_OUTPUT=$(terraform plan -no-color -detailed-exitcode 2>&1) EXIT_CODE=$? if [ $EXIT_CODE -eq 2 ]; then echo "WARNING: Drift detected in $ENV!" # 发送告警(如Slack、PagerDuty) curl -X POST -H "Content-Type: application/json" \ -d "{\"text\": \"IaC漂移告警: $ENV 环境存在基础设施漂移\"}" \ $SLACK_WEBHOOK_URL elif [ $EXIT_CODE -eq 1 ]; then echo "ERROR: Terraform plan failed in $ENV" exit 1 else echo "OK: No drift detected in $ENV" fi done

漂移处理最佳实践:(1)定时执行drift detection(建议每6小时一次);(2)漂移修复优先通过更新IaC代码而不是手动操作;(3)紧急情况下可手工修复,但必须事后同步更新代码;(4)使用`terraform state rm`和`terraform import`处理资源重建场景。

六、IaC测试与安全合规

基础设施代码同样需要测试。IaC测试包括静态代码分析、安全扫描、成本估算和集成验证四个层次。在CI/CD流水线中嵌入自动化的IaC测试可以在基础设施部署前发现潜在问题,避免将错误配置应用到生产环境。

静态安全扫描:tfsec与Checkov

tfsec和Checkov是Terraform代码最常用的静态安全分析工具。它们分析HCL代码中的安全配置问题,如开放的安全组规则、未加密的存储卷、明文Secrets等。

# tfsec安全扫描结果示例 $ tfsec . # 检测到的问题: - [AWS002] Resource 'aws_security_group_rule.web_sg' should not have an ingress rule that allows all traffic (0.0.0.0/0) on port 22 (SSH). Severity: CRITICAL Guide: https://tfsec.dev/docs/aws/vpc/no-public-ingress-sgr - [AWS039] Resource 'aws_s3_bucket.assets' does not have server-side encryption enabled. Severity: HIGH Guide: https://tfsec.dev/docs/aws/s3/enable-bucket-encryption # Checkov扫描 checkov --directory . --framework terraform --output junitxml > checkov-report.xml # 在CI/CD中集成Checkov checkov --directory terraform/ \ --framework terraform \ --skip-check "CKV_AWS_xx" \ --quiet \ --compact \ --download-external-modules false

成本估算:Infracost

Infracost是基础设施成本估算工具,在代码审查阶段即可估算Terraform变更带来的月度成本变化,帮助团队控制云支出。

# Infracost成本估算 infracost breakdown --path terraform/ --format json --out cost.json # PR中展示成本变化 infracost diff --path terraform/ \ --terraform-plan-flags "-var-file=prod.tfvars" \ --format github-comment \ --out comment.md # 在CI中集成Infracost # 输出示例: NAME MONTHLY QTY UNIT MONTHLY COST aws_instance.app_server 1 hours $84.32 aws_db_instance.postgres 1 hours $297.68 aws_lb.web 1 hours $22.05 aws_s3_bucket.assets 1 GB $2.45 -------------------------------------------------------- TOTAL $406.50

Terraform Plan验证与沙箱测试

在合并到main分支之前,每个PR都应生成Terraform Plan并在PR评论中展示变更摘要。沙箱环境(Sandbox/Ephemeral Environment)为每次PR提供临时基础设施环境,进行端到端集成测试。

# 沙箱环境自动化创建脚本 #!/bin/bash # sandbox_setup.sh - 为每个PR创建临时基础设施 PR_ID=$1 SANDBOX_DIR="terraform/environments/sandbox" cat > $SANDBOX_DIR/sandbox.tfvars << EOF environment = "sandbox" pr_id = "$PR_ID" instance_type = "t3.micro" enable_monitoring = false resource_prefix = "sbox-${PR_ID}" EOF # 执行terraform apply创建沙箱 cd $SANDBOX_DIR terraform init -backend-config="backend-sandbox.hcl" terraform plan -var-file=sandbox.tfvars -out=sandbox.tfplan terraform apply -auto-approve sandbox.tfplan # 运行集成测试 pytest tests/integration/ --env sandbox --junitxml=test-results.xml

合规检查集成

企业级基础设施需要满足SOC2、HIPAA、PCI-DSS等合规要求。通过策略即代码(Policy as Code)工具如Sentinel(HashiCorp)、OPA(Open Policy Agent)或Kyverno,可以在IaC流水线中嵌入合规规则检查。

# OPA策略示例:禁止创建公网数据库 package terraform.deny # 如果RDS实例分配了公网访问,则拒绝 deny[msg] { resource := input.resource.aws_db_instance[_] resource.config.publicly_accessible == true msg := sprintf("RDS实例 %v 不能设置为公网可访问", [resource.config.identifier]) } # 如果安全组有全开放SSH入站规则,则拒绝 deny[msg] { sg := input.resource.aws_security_group[_] rule := sg.config.ingress[_] rule.from_port <= 22 rule.to_port >= 22 rule.cidr_blocks[_] == "0.0.0.0/0" msg := sprintf("安全组 %v 不允许SSH端口对所有源开放", [sg.config.name]) } # 执行OPA检查 # opa eval --data policy/terraform.rego --input plan.json "data.terraform.deny"

完整的IaC测试流水线

# 完整的GitHub Actions IaC测试流水线 name: "IaC Test Suite" on: pull_request: paths: ["terraform/**"] jobs: static_analysis: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: "tfsec: Security Scan" uses: aquasecurity/tfsec-action@v1 - name: "Checkov: Compliance Scan" uses: bridgecrewio/checkov-action@v12 - name: "Infracost: Cost Estimation" uses: infracost/actions/setup@v2 with: api-key: ${{ secrets.INFRACOST_API_KEY }} - name: "Terraform Fmt Check" run: terraform fmt -check -recursive - name: "Terraform Validate" run: terraform validate -no-color - name: "OPA Policy Check" run: | terraform show -json tfplan > plan.json opa eval --data policy/ --input plan.json \ "data.terraform.deny"

IaC测试成熟度模型:L1-语法检查(terraform validate)→ L2-安全扫描(tfsec/Checkov)→ L3-成本估算(Infracost)→ L4-策略合规(OPA/Sentinel)→ L5-集成测试(沙箱环境+自动化测试)→ L6-混沌工程(Chaos Engineering for Infrastructure)。大多数团队的目标是达到L3-L4级别,关键基础设施应达到L5。

七、工作流最佳实践总结

IaC工作流黄金法则:

  1. 代码化一切:所有基础设施变更必须通过代码进行,禁止手动操作
  2. 单一事实来源:Git仓库中的代码是基础设施的唯一真实来源
  3. 计划先行:每次变更前执行plan,评审变更影响范围
  4. 最小权限原则:CI/CD角色只授予执行工作流所需的最小权限
  5. 不可变基础设施:避免原地修改服务器,优先采用重建替换策略
  6. 安全左移:安全扫描和合规检查在代码提交阶段完成,而非部署后
  7. 状态即资产:state文件是核心资产,做好备份和版本控制
  8. 持续验证:定期漂移检测+自动化修复,保持代码与实际环境一致
# Makefile —— 统一IaC操作入口 .PHONY: init plan apply destroy fmt lint test security cost plan-approve init: terraform init -backend-config=backend-$(env).hcl plan: init terraform plan -var-file=$(env).tfvars -out=tfplan -no-color apply: plan terraform apply tfplan fmt: terraform fmt -recursive lint: tflint --format compact test: checkov --directory . --framework terraform --compact tfsec --format sarif security: lint test cost: infracost breakdown --path . --terraform-var-file=$(env).tfvars destroy: terraform plan -var-file=$(env).tfvars -destroy -out=destroy.tfplan terraform apply destroy.tfplan # 使用示例 # make env=dev plan # make env=production plan-approve # make env=staging security

"基础设施即代码不是工具问题,而是文化问题。它要求团队将Ops实践与Dev流程对齐,将基础设施视为一等公民纳入软件交付生命周期。"