← 返回Claude Code工作流目录
← 返回学习笔记首页
专题: Claude Code 工作流系统学习
关键词: Claude Code, IaC, Terraform, Pulumi, Ansible, GitOps, ArgoCD, Terraform State, 漂移检测
一、IaC框架选择与对比
基础设施即代码(Infrastructure as Code,IaC)是DevOps实践的核心支柱,它将基础设施的配置和管理与软件开发流程对齐。选择合适的IaC框架决定了团队的工作效率和基础设施的可维护性。目前主流的IaC框架包括声明式工具(Terraform、OpenTofu、CloudFormation、Pulumi)和配置管理工具(Ansible),以及镜像构建工具(Packer)。
主流IaC框架能力对比
框架 类型 语言 状态管理 多云支持 适用场景
Terraform 声明式 HCL 有(State) 是 基础设施资源编排
OpenTofu 声明式 HCL 有(State) 是 Terraform开源替代
Pulumi 声明式/命令式 TS/Python/Go/C# 有(State) 是 用编程语言管理基础设施
CloudFormation 声明式 YAML/JSON 有(AWS管理) 否(仅AWS) AWS专属资源编排
CDK 命令式 TS/Python/Java/C#/Go 有(CloudFormation) 否(仅AWS) AWS基础设施编程
Ansible 声明式(过程式执行) YAML 无(幂等设计) 是 配置管理和应用部署
Packer 声明式 HCL/JSON 无 是 统一镜像构建
Terraform HCL基础示例
terraform {
required_version = "~> 1.6"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.23"
}
}
backend "s3" {
bucket = "my-company-terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
Pulumi TypeScript示例
import * as aws from "@pulumi/aws" ;
import * as awsx from "@pulumi/awsx" ;
// 创建VPC
const vpc = new awsx.ec2.Vpc("main-vpc" , {
cidrBlock: "10.0.0.0/16" ,
numberOfAvailabilityZones: 3 ,
numberOfNatGateways: 1 ,
tags: { Name: "main-vpc" , Environment: "production" },
});
// 创建ECS Fargate集群
const cluster = new aws.ecs.Cluster("app-cluster" , {
name: "app-cluster" ,
tags: { Environment: "production" },
});
export const vpcId = vpc.vpcId;
export const publicSubnetIds = vpc.publicSubnetIds;
Ansible Playbook示例
---
- name : Provision web servers
hosts : webservers
become : yes
vars :
app_port: 8080
app_version: "1.2.3"
tasks :
- name : Install Nginx
apt:
name: nginx
state: present
update_cache: yes
- name : Configure Nginx reverse proxy
template:
src: nginx.conf.j2
dest: /etc/nginx/sites-available/app
notify: restart nginx
- name : Deploy application container
docker_container:
name: app
image: "myregistry/app:{{ app_version }}"
ports: "{{ app_port }}:8080"
state: started
restart_policy: always
handlers :
- name : restart nginx
service:
name: nginx
state: restarted
框架选择建议: 对于多云基础设施资源编排,推荐Terraform或OpenTofu;需要强类型检查时可选用Pulumi的TypeScript或Go;配置管理和应用部署推荐Ansible;AWS纯血环境可用CDK获取最佳开发体验。
二、资源定义与管理
IaC的核心能力是用代码定义云资源,包括网络、计算、存储、数据库、安全策略等。资源定义需遵循模块化、参数化、可复用三大原则。通过模块化设计,团队可将基础设施划分为可独立管理的逻辑单元,每个模块有明确的输入输出接口。
VPC网络拓扑定义
# 定义VPC、子网、安全组和NAT网关
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = { Name = "main-vpc" , Environment = var.environment }
}
resource "aws_subnet" "public" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8 , count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
map_public_ip_on_launch = true
tags = { Name = "public-${count.index}" , Tier = "public" }
}
resource "aws_security_group" "web_sg" {
name = "web-security-group"
description = "Allow HTTP/HTTPS ingress"
vpc_id = aws_vpc.main.id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0" ]
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0" ]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0" ]
}
tags = { Name = "web-sg" }
}
EC2与RDS资源定义
# 创建EC2实例和RDS数据库
resource "aws_instance" "app_server" {
ami = data.aws_ami.amazon_linux_2.id
instance_type = var.instance_type
subnet_id = aws_subnet.public[0 ].id
vpc_security_group_ids = [aws_security_group.web_sg.id]
iam_instance_profile = aws_iam_instance_profile.app.name
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y docker
systemctl enable --now docker
EOF
root_block_device {
volume_type = "gp3"
volume_size = 100
encrypted = true
}
tags = { Name = "app-server" }
}
resource "aws_db_instance" "postgres" {
identifier = "app-database"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.r6g.large"
db_name = "appdb"
username = "admin"
password = random_password.db_password.result
vpc_security_group_ids = [aws_security_group.database_sg.id]
db_subnet_group_name = aws_db_subnet_group.private.name
storage_encrypted = true
backup_retention_period = 30
backup_window = "03:00-04:00"
maintenance_window = "sun:05:00-sun:06:00"
deletion_protection = true
skip_final_snapshot = false
tags = { Name = "app-postgres" }
}
Lambda无服务器函数
# 创建Lambda函数并配置触发器
resource "aws_lambda_function" "api_handler" {
filename = "handler.zip"
function_name = "api-handler"
role = aws_iam_role.lambda_exec.arn
handler = "index.handler"
runtime = "nodejs18.x"
memory_size = 256
timeout = 30
publish = true
environment {
variables = {
DB_HOST = aws_db_instance.postgres.address
DB_PORT = aws_db_instance.postgres.port
DB_NAME = aws_db_instance.postgres.db_name
S3_BUCKET = aws_s3_bucket.assets.id
}
}
tracing_config {
mode = "Active"
}
tags = { Service = "api" }
}
resource "aws_lambda_permission" "api_gw" {
statement_id = "AllowAPIGatewayInvoke"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.api_handler.function_name
principal = "apigateway.amazonaws.com"
source_arn = "${aws_api_gateway_rest_api.api.execution_arn}/*/*"
}
S3存储桶与负载均衡器
# 创建S3存储桶(前端静态资源)
resource "aws_s3_bucket" "assets" {
bucket = "mycompany-assets-${var.environment}"
tags = { Name = "assets" }
}
resource "aws_s3_bucket_versioning" "assets" {
bucket = aws_s3_bucket.assets.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_public_access_block" "assets" {
bucket = aws_s3_bucket.assets.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# 创建应用负载均衡器
resource "aws_lb" "web" {
name = "web-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb_sg.id]
subnets = aws_subnet.public[*].id
enable_deletion_protection = true
tags = { Name = "web-alb" }
}
resource "aws_lb_target_group" "web" {
name = "web-tg"
port = 80
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
path = "/health"
interval = 30
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
}
}
资源间依赖与状态传递
基础设施资源之间天然存在依赖关系,IaC工具通过隐式引用和显式depends_on来解析执行顺序。Terraform会分析所有资源引用关系自动构建依赖图,但某些时候需要显式声明依赖(如在对等连接中)。正确管理资源依赖是避免基础设施部署失败的关键。
# 隐式依赖:通过引用自动建立依赖
resource "aws_eip" "nat" {
domain = "vpc"
instance = aws_instance.nat.id # 隐式依赖aws_instance.nat
}
# 显式依赖:当Terraform无法自动推断时
resource "aws_vpc_peering_connection" "peer" {
vpc_id = aws_vpc.main.id
peer_vpc_id = aws_vpc.other.id
auto_accept = true
depends_on = [aws_vpc.main, aws_vpc.other, aws_subnet.public]
}
# 数据源:读取已有资源信息
data "aws_ami" "amazon_linux_2" {
most_recent = true
owners = ["amazon" ]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2" ]
}
}
Kubernetes集群与工作负载
# 创建EKS集群
resource "aws_eks_cluster" "main" {
name = "main-eks"
role_arn = aws_iam_role.eks_cluster.arn
version = "1.27"
vpc_config {
subnet_ids = aws_subnet.private[*].id
endpoint_private_access = true
endpoint_public_access = false
}
enabled_cluster_log_types = ["api" , "audit" , "authenticator" ]
}
resource "aws_eks_node_group" "general" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "general"
node_role_arn = aws_iam_role.eks_nodes.arn
subnet_ids = aws_subnet.private[*].id
instance_types = ["r6i.large" ]
scaling_config {
desired_size = 3
min_size = 1
max_size = 10
}
update_config {
max_unavailable = 1
}
tags = { Cluster = "main-eks" }
}
三、配置管理
配置管理是IaC工作流中处理应用程序和环境差异化的关键环节。从环境变量到敏感信息(Secrets),再到动态运行时配置,一个好的配置管理体系应支持不同环境的差异化配置、配置的版本追溯、敏感信息的加密存储以及对运行时配置的实时变更。
环境变量与配置文件
在不同环境中(开发、测试、预发、生产)基础设施和应用配置通常存在差异。Terraform通过变量文件(.tfvars)和环境变量实现环境分离,避免硬编码。
# 开发环境 dev.tfvars
environment = "dev"
instance_type = "t3.medium"
db_instance = "db.t4g.small"
min_capacity = 1
max_capacity = 5
enable_https = false
log_retention = 7
# 生产环境 prod.tfvars
environment = "prod"
instance_type = "r6i.xlarge"
db_instance = "db.r6g.large"
min_capacity = 3
max_capacity = 20
enable_https = true
log_retention = 365
环境隔离最佳实践: 使用单独的.tfvars文件管理不同环境配置;在CI/CD流水线中通过环境变量注入Secrets;配置文件和Secrets使用.gitignore排除,通过安全渠道分发。
Secrets管理与AWS参数存储
# 使用AWS Secrets Manager存储敏感信息
resource "aws_secretsmanager_secret" "db_credentials" {
name = "db-credentials-${var.environment}"
recovery_window_in_days = 7
rotation_rules {
automatically_after_days = 90
}
}
resource "aws_secretsmanager_secret_version" "db_credentials" {
secret_id = aws_secretsmanager_secret.db_credentials.id
secret_string = jsonencode({
username = "admin"
password = random_password.db.result
host = aws_db_instance.postgres.address
port = aws_db_instance.postgres.port
})
}
# 使用AWS SSM Parameter Store存储配置
resource "aws_ssm_parameter" "app_config" {
name = "/app/${var.environment}/config"
type = "SecureString"
value = jsonencode({
feature_flags = { new_checkout = true, dark_mode = false }
api_rate_limit = 1000
cache_ttl_seconds = 300
external_services = {
payment_gateway_url = "https://pay.example.com"
webhook_url = "https://hooks.example.com/events"
}
})
}
AppConfig动态配置
AWS AppConfig与LaunchDarkly等工具支持运行时动态配置下发,无需重新部署即可调整功能开关、限流阈值等参数。这在紧急变更和金丝雀发布场景中尤其重要。
# AWS AppConfig动态配置资源
resource "aws_appconfig_application" "app" {
name = "my-app"
}
resource "aws_appconfig_configuration_profile" "feature_flags" {
application_id = aws_appconfig_application.app.id
name = "feature-flags"
location_uri = "hosted"
type = "AWS.AppConfig.FeatureFlags"
validator {
type = "JSON_SCHEMA"
content = jsonencode({
"$schema" = "http://json-schema.org/draft-07/schema#"
type = "object"
})
}
}
resource "aws_appconfig_environment" "production" {
application_id = aws_appconfig_application.app.id
name = "production"
monitors {
alarm_arn = aws_cloudwatch_metric_alarm.app_error.arn
alarm_role_arn = aws_iam_role.appconfig.arn
}
}
配置版本控制与审计
所有配置文件应纳入Git仓库管理,配合Terraform的plan输出形成完整的变更审计链。每次配置变更都应在CI/CD流水线中生成plan报告并归档。通过Git标签和版本号管理配置基线,在灾难恢复时可快速回滚到已知正确的配置版本。
配置管理核心原则: (1)代码与配置分离,敏感信息绝不硬编码;(2)环境差异化通过变量注入,而不是分支或复制代码;(3)配置变更需要审查和审批流程;(4)运行时配置优先于部署时配置,但要严格控制变更权限。
四、状态管理
Terraform State是IaC中最重要的概念之一。状态文件记录了实际部署的基础设施资源的元数据和依赖关系,是Terraform判断"真实世界"与"期望配置"之间差异的依据。缺乏正确的状态管理会导致资源漂移、并发冲突和数据丢失。
Terraform State详解
Terraform State将配置文件中声明的资源映射到云提供商中的真实资源ID。每个资源块都会被分配一个唯一的资源地址,state中保存了资源的属性值、依赖关系、元数据(如provider配置)等。State的格式为JSON,可直接查看但建议不要手动修改。
# 查看当前状态列表
terraform state list
# 输出示例:
data.aws_ami.amazon_linux_2
aws_vpc.main
aws_subnet.public[0]
aws_subnet.public[1]
aws_subnet.public[2]
aws_security_group.web_sg
aws_instance.app_server
aws_db_instance.postgres
aws_lb.web
# 查看特定资源状态详情
terraform state show aws_instance.app_server
# 将资源从状态中移除(不移除实际资源)
terraform state rm aws_instance.app_server
# 将现有资源导入状态文件
terraform import aws_instance.app_server i-0abcd1234efgh5678
远程状态存储与状态锁定
生产环境必须使用远程后端存储状态文件,避免本地存储带来的丢失风险。主流后端包括AWS S3+DynamoDB、Azure Storage、GCS、Terraform Cloud等。远程后端配合状态锁定机制(如DynamoDB)防止并发操作导致的状态损坏。
# S3后端 + DynamoDB状态锁定
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
# DynamoDB锁表定义(需先手动创建或通过另一个backend创建)
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
server_side_encryption {
enabled = true
}
}
状态迁移方案
团队重构或后端切换时需要迁移状态文件。Terraform支持`terraform state mv`命令进行状态重命名和重构,`terraform state pull/push`在本地和后端之间同步状态。从本地迁移到远程后端时,只需修改backend配置后执行`terraform init -migrate -reconfigure`。
# 状态迁移:本地S3 → S3(不同桶)
# 1. 修改backend配置指向新桶
# 2. 运行migrate命令
terraform init -migrate -reconfigure
# 状态迁移:重命名资源(模块重构时)
# 假设旧资源地址:module.legacy.aws_vpc.main
# 新资源地址:module.network.aws_vpc.main
terraform state mv module.legacy.aws_vpc.main module.network.aws_vpc.main
# 迁移整个模块
terraform state mv -dry-run 'module.legacy' 'module.network'
terraform state mv 'module.legacy' 'module.network'
状态恢复与灾难应对
重要: 状态文件丢失后虽可重建(terraform import逐一导入),但过程极其繁琐且可能丢失资源间依赖关系。建议开启S3桶版本控制作为状态文件的多版本备份策略,设置生命周期规则自动清理旧版本。
# 启用S3桶版本控制保护状态文件
resource "aws_s3_bucket_versioning" "state_bucket" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
# 从S3版本恢复状态
# 1. 列出状态文件所有版本
aws s3api list-object-versions --bucket company-terraform-state --prefix network/terraform.tfstate
# 2. 下载指定版本
aws s3api get-object --bucket company-terraform-state --key network/terraform.tfstate \
--version-id "VersionId" terraform.tfstate.recovered
# 3. 推入当前backend
terraform state push terraform.tfstate.recovered
导入现有资源
将手动创建的资源纳入IaC管理是常见的迁移场景。Terraform通过`terraform import`命令将现有资源映射到配置中的resource块。导入后需要执行`terraform plan`确保配置与真实资源一致。
# 配置文件中定义对应的resource块
resource "aws_s3_bucket" "existing_data" {
bucket = "company-existing-data-bucket"
}
# 导入命令
terraform import aws_s3_bucket.existing_data company-existing-data-bucket
# 批量导入辅助脚本
# 先执行plan查看资源属性差异,再调整配置对齐
terraform plan
# 使用import块批量导入(Terraform 1.5+)
import {
to = aws_s3_bucket.existing_data
id = "company-existing-data-bucket"
}
五、CI/CD基础设施
基础设施的CI/CD流水线是IaC工作流自动化的核心。与传统应用CI/CD不同,基础设施流水线需要处理plan/apply两阶段、审批流程、状态文件安全、漂移检测等特殊需求。GitOps作为基础设施部署的先进模式,将Git仓库作为单一事实来源,通过自动化算子(Operator)实现声明式基础设施管理。
GitOps工作流架构
GitOps的核心思想是:Git仓库中保存基础设施的"期望状态",自动化算子持续检测并确保实际环境与期望状态一致。这消除了手动操作失误,提供了完整的变更审计能力。
# GitOps工作流示例:GitHub Actions + Terraform
name : "Terraform Infrastructure Pipeline"
on :
push :
branches : [main]
paths :
- "terraform/**"
pull_request :
branches : [main]
paths :
- "terraform/**"
env :
TF_VERSION: "1.6.0"
TF_WORKING_DIR: "terraform"
jobs :
terraform_plan :
name : "Terraform Plan"
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : "Setup Terraform"
uses : hashicorp/setup-terraform@v2
with :
terraform_version: ${{ env.TF_VERSION }}
- name : "Terraform Init"
run : terraform init
working-directory : ${{ env.TF_WORKING_DIR }}
- name : "Terraform Validate"
run : terraform validate -no-color
working-directory : ${{ env.TF_WORKING_DIR }}
- name : "Terraform Plan"
id : plan
run : terraform plan -no-color -out=tfplan
working-directory : ${{ env.TF_WORKING_DIR }}
- name : "Post Plan Comment"
if : github.event_name == 'pull_request'
uses : actions/github-script@v6
with :
script: |
const output = `## Terraform Plan
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\``
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
})
terraform_apply :
name : "Terraform Apply"
needs : [terraform_plan]
if : github.ref == 'refs/heads/main' && github.event_name == 'push'
runs-on : ubuntu-latest
environment : production
steps :
- uses : actions/checkout@v4
- name : "Setup Terraform"
uses : hashicorp/setup-terraform@v2
- name : "Terraform Init"
run : terraform init
working-directory : ${{ env.TF_WORKING_DIR }}
- name : "Terraform Apply"
run : terraform apply -auto-approve tfplan
working-directory : ${{ env.TF_WORKING_DIR }}
ArgoCD与Kubernetes GitOps
ArgoCD是Kubernetes生态中最流行的GitOps工具,它持续监控Git仓库中的应用定义(Helm Charts、Kustomize或纯YAML),自动同步到目标集群。当Git仓库中的配置发生变更时,ArgoCD自动将集群状态调整为期望状态。
# ArgoCD Application定义
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-infra
namespace: argocd
spec:
project: default
source:
repoURL: "https://github.com/company/gitops-infra.git"
targetRevision: main
path: kubernetes/overlays/production
helm:
valueFiles:
- values-prod.yaml
destination:
server: "https://kubernetes.default.svc"
namespace: production
syncPolicy:
automated:
prune: true # 自动删除Git中不存在的资源
selfHeal: true # 自动修复手动修改
syncOptions:
- CreateNamespace=true
- ApplyOutOfSyncOnly=true
Flux CD与基础设施流水线
Flux是CNCF孵化的另一个GitOps工具,与ArgoCD相比,Flux更轻量、与Kubernetes控制器模式集成更深。Flux可以管理Terraform资源、Helm Release和OCI工件。
# Flux Terraform Controller资源定义
apiVersion: infra.contrib.fluxcd.io/v1alpha2
kind: Terraform
metadata:
name: network-infra
spec:
interval: 1h
path: ./terraform/network
sourceRef:
kind: GitRepository
name: infra-repo
approvePlan: auto
destroyResourcesOnDeletion: true
vars:
- name: environment
value: production
backendConfig:
disable: false
customConfiguration:
secretRef:
name: tf-backend-config
计划审批与自动应用策略
推荐的分支策略: 通过Git Flow或Trunk-Based开发模式管理IaC变更。每个PR生成Terraform Plan并作为PR评论展示。合并到main分支后自动执行Apply。生产环境的Apply应设置手动审批门禁,关键资源(数据库、网络配置)的变更需额外审批。
漂移检测机制
基础设施漂移是指实际云环境配置与IaC代码定义的状态产生差异。漂移可能由手动操作、自动缩放事件、API直接调用或第三方集成导致。定期漂移检测是保持IaC工作流可靠性的关键。
# Terraform漂移检测脚本(可定时执行)
#!/bin/bash
# drift_detection.sh - 定时检测基础设施漂移
ENVIRONMENTS=("dev" "staging" "production" )
for ENV in "${ENVIRONMENTS[@]}" ; do
echo "=== Checking $ENV environment ==="
# 切换到对应目录
cd terraform/environments/$ENV
# 初始化并刷新状态
terraform init -backend-config="backend-${ENV}.hcl"
terraform refresh -no-color
# 生成plan检测变更
PLAN_OUTPUT=$(terraform plan -no-color -detailed-exitcode 2>&1)
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "WARNING: Drift detected in $ENV!"
# 发送告警(如Slack、PagerDuty)
curl -X POST -H "Content-Type: application/json" \
-d "{\"text\": \"IaC漂移告警: $ENV 环境存在基础设施漂移\"}" \
$SLACK_WEBHOOK_URL
elif [ $EXIT_CODE -eq 1 ]; then
echo "ERROR: Terraform plan failed in $ENV"
exit 1
else
echo "OK: No drift detected in $ENV"
fi
done
漂移处理最佳实践: (1)定时执行drift detection(建议每6小时一次);(2)漂移修复优先通过更新IaC代码而不是手动操作;(3)紧急情况下可手工修复,但必须事后同步更新代码;(4)使用`terraform state rm`和`terraform import`处理资源重建场景。
六、IaC测试与安全合规
基础设施代码同样需要测试。IaC测试包括静态代码分析、安全扫描、成本估算和集成验证四个层次。在CI/CD流水线中嵌入自动化的IaC测试可以在基础设施部署前发现潜在问题,避免将错误配置应用到生产环境。
静态安全扫描:tfsec与Checkov
tfsec和Checkov是Terraform代码最常用的静态安全分析工具。它们分析HCL代码中的安全配置问题,如开放的安全组规则、未加密的存储卷、明文Secrets等。
# tfsec安全扫描结果示例
$ tfsec .
# 检测到的问题:
- [AWS002] Resource 'aws_security_group_rule.web_sg' should not
have an ingress rule that allows all traffic (0.0.0.0/0 )
on port 22 (SSH).
Severity: CRITICAL
Guide: https://tfsec.dev/docs/aws/vpc/no-public-ingress-sgr
- [AWS039] Resource 'aws_s3_bucket.assets' does not have
server-side encryption enabled.
Severity: HIGH
Guide: https://tfsec.dev/docs/aws/s3/enable-bucket-encryption
# Checkov扫描
checkov --directory . --framework terraform --output junitxml > checkov-report.xml
# 在CI/CD中集成Checkov
checkov --directory terraform/ \
--framework terraform \
--skip-check "CKV_AWS_xx" \
--quiet \
--compact \
--download-external-modules false
成本估算:Infracost
Infracost是基础设施成本估算工具,在代码审查阶段即可估算Terraform变更带来的月度成本变化,帮助团队控制云支出。
# Infracost成本估算
infracost breakdown --path terraform/ --format json --out cost.json
# PR中展示成本变化
infracost diff --path terraform/ \
--terraform-plan-flags "-var-file=prod.tfvars" \
--format github-comment \
--out comment.md
# 在CI中集成Infracost
# 输出示例:
NAME MONTHLY QTY UNIT MONTHLY COST
aws_instance.app_server 1 hours $84.32
aws_db_instance.postgres 1 hours $297.68
aws_lb.web 1 hours $22.05
aws_s3_bucket.assets 1 GB $2.45
--------------------------------------------------------
TOTAL $406.50
Terraform Plan验证与沙箱测试
在合并到main分支之前,每个PR都应生成Terraform Plan并在PR评论中展示变更摘要。沙箱环境(Sandbox/Ephemeral Environment)为每次PR提供临时基础设施环境,进行端到端集成测试。
# 沙箱环境自动化创建脚本
#!/bin/bash
# sandbox_setup.sh - 为每个PR创建临时基础设施
PR_ID=$1
SANDBOX_DIR="terraform/environments/sandbox"
cat > $SANDBOX_DIR/sandbox.tfvars << EOF
environment = "sandbox"
pr_id = "$PR_ID"
instance_type = "t3.micro"
enable_monitoring = false
resource_prefix = "sbox-${PR_ID}"
EOF
# 执行terraform apply创建沙箱
cd $SANDBOX_DIR
terraform init -backend-config="backend-sandbox.hcl"
terraform plan -var-file=sandbox.tfvars -out=sandbox.tfplan
terraform apply -auto-approve sandbox.tfplan
# 运行集成测试
pytest tests/integration/ --env sandbox --junitxml=test-results.xml
合规检查集成
企业级基础设施需要满足SOC2、HIPAA、PCI-DSS等合规要求。通过策略即代码(Policy as Code)工具如Sentinel(HashiCorp)、OPA(Open Policy Agent)或Kyverno,可以在IaC流水线中嵌入合规规则检查。
# OPA策略示例:禁止创建公网数据库
package terraform.deny
# 如果RDS实例分配了公网访问,则拒绝
deny[msg] {
resource := input.resource.aws_db_instance[_]
resource.config.publicly_accessible == true
msg := sprintf("RDS实例 %v 不能设置为公网可访问" , [resource.config.identifier])
}
# 如果安全组有全开放SSH入站规则,则拒绝
deny[msg] {
sg := input.resource.aws_security_group[_]
rule := sg.config.ingress[_]
rule.from_port <= 22
rule.to_port >= 22
rule.cidr_blocks[_] == "0.0.0.0/0"
msg := sprintf("安全组 %v 不允许SSH端口对所有源开放" , [sg.config.name])
}
# 执行OPA检查
# opa eval --data policy/terraform.rego --input plan.json "data.terraform.deny"
完整的IaC测试流水线
# 完整的GitHub Actions IaC测试流水线
name : "IaC Test Suite"
on : pull_request :
paths : ["terraform/**" ]
jobs :
static_analysis :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : "tfsec: Security Scan"
uses : aquasecurity/tfsec-action@v1
- name : "Checkov: Compliance Scan"
uses : bridgecrewio/checkov-action@v12
- name : "Infracost: Cost Estimation"
uses : infracost/actions/setup@v2
with :
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name : "Terraform Fmt Check"
run : terraform fmt -check -recursive
- name : "Terraform Validate"
run : terraform validate -no-color
- name : "OPA Policy Check"
run : |
terraform show -json tfplan > plan.json
opa eval --data policy/ --input plan.json \
"data.terraform.deny"
IaC测试成熟度模型: L1-语法检查(terraform validate)→ L2-安全扫描(tfsec/Checkov)→ L3-成本估算(Infracost)→ L4-策略合规(OPA/Sentinel)→ L5-集成测试(沙箱环境+自动化测试)→ L6-混沌工程(Chaos Engineering for Infrastructure)。大多数团队的目标是达到L3-L4级别,关键基础设施应达到L5。
七、工作流最佳实践总结
IaC工作流黄金法则:
代码化一切: 所有基础设施变更必须通过代码进行,禁止手动操作
单一事实来源: Git仓库中的代码是基础设施的唯一真实来源
计划先行: 每次变更前执行plan,评审变更影响范围
最小权限原则: CI/CD角色只授予执行工作流所需的最小权限
不可变基础设施: 避免原地修改服务器,优先采用重建替换策略
安全左移: 安全扫描和合规检查在代码提交阶段完成,而非部署后
状态即资产: state文件是核心资产,做好备份和版本控制
持续验证: 定期漂移检测+自动化修复,保持代码与实际环境一致
# Makefile —— 统一IaC操作入口
.PHONY: init plan apply destroy fmt lint test security cost plan-approve
init:
terraform init -backend-config=backend-$(env).hcl
plan: init
terraform plan -var-file=$(env).tfvars -out=tfplan -no-color
apply: plan
terraform apply tfplan
fmt:
terraform fmt -recursive
lint:
tflint --format compact
test:
checkov --directory . --framework terraform --compact
tfsec --format sarif
security: lint test
cost:
infracost breakdown --path . --terraform-var-file=$(env).tfvars
destroy:
terraform plan -var-file=$(env).tfvars -destroy -out=destroy.tfplan
terraform apply destroy.tfplan
# 使用示例
# make env=dev plan
# make env=production plan-approve
# make env=staging security
"基础设施即代码不是工具问题,而是文化问题。它要求团队将Ops实践与Dev流程对齐,将基础设施视为一等公民纳入软件交付生命周期。"