Why infrastructure as code matters for startups
If your infrastructure lives in the AWS console — created by a series of clicks that nobody documented — you have a problem that gets worse every single week. New environments take days to set up. Nobody knows what's running or why. And when something breaks, "just recreate it" turns into a multi-day archaeology project.
Terraform solves this. Your entire infrastructure is defined in code files that live in your repo, go through code review, and can spin up identical environments in minutes. It's one of the highest-ROI investments an early-stage engineering team can make.
This guide walks through the core AWS resources most startups need — VPC, compute, database, and storage — with production-ready Terraform configurations you can use as a starting point.
Getting started
First, install Terraform and configure your AWS credentials:
# Install Terraform (macOS)
brew tap hashicorp/tap
brew install hashicorp/tap/terraform
# Verify installation
terraform --version
# Configure AWS credentials
aws configure
# Or export environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="us-east-1"
Create a project structure that separates concerns:
mkdir -p terraform/{modules,environments}
mkdir -p terraform/environments/{staging,production}
mkdir -p terraform/modules/{vpc,ecs,rds,s3}
Set up your provider configuration. Every Terraform project starts here:
# terraform/environments/production/main.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "your-company-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = var.project_name
}
}
}
The backend "s3" block stores your Terraform state remotely so your whole team can collaborate, and the DynamoDB table prevents two people from applying changes at the same time. Set this up before anything else.
VPC — your network foundation
Every AWS deployment starts with a VPC. This configuration creates a production-grade network with public and private subnets across multiple availability zones:
# terraform/modules/vpc/main.tf
variable "project_name" {
type = string
description = "Name prefix for all resources"
}
variable "environment" {
type = string
description = "Environment name (staging, production)"
}
variable "vpc_cidr" {
type = string
default = "10.0.0.0/16"
description = "CIDR block for the VPC"
}
variable "availability_zones" {
type = list(string)
default = ["us-east-1a", "us-east-1b"]
description = "Availability zones to use"
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.project_name}-${var.environment}-vpc"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${var.project_name}-${var.environment}-igw"
}
}
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.project_name}-${var.environment}-public-${count.index}"
Tier = "public"
}
}
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 8, count.index + 10)
availability_zone = var.availability_zones[count.index]
tags = {
Name = "${var.project_name}-${var.environment}-private-${count.index}"
Tier = "private"
}
}
resource "aws_eip" "nat" {
count = length(var.availability_zones)
domain = "vpc"
tags = {
Name = "${var.project_name}-${var.environment}-nat-eip-${count.index}"
}
}
resource "aws_nat_gateway" "main" {
count = length(var.availability_zones)
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = {
Name = "${var.project_name}-${var.environment}-nat-${count.index}"
}
depends_on = [aws_internet_gateway.main]
}
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "${var.project_name}-${var.environment}-public-rt"
}
}
resource "aws_route_table" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
nat_gateway_id = aws_nat_gateway.main[count.index].id
}
tags = {
Name = "${var.project_name}-${var.environment}-private-rt-${count.index}"
}
}
resource "aws_route_table_association" "public" {
count = length(var.availability_zones)
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "private" {
count = length(var.availability_zones)
subnet_id = aws_subnet.private[count.index].id
route_table_id = aws_route_table.private[count.index].id
}
output "vpc_id" {
value = aws_vpc.main.id
}
output "public_subnet_ids" {
value = aws_subnet.public[*].id
}
output "private_subnet_ids" {
value = aws_subnet.private[*].id
}
This gives you a solid network foundation: public subnets for load balancers, private subnets for your application and database, NAT gateways so private resources can reach the internet, and everything spread across multiple availability zones for resilience. For staging, you can use a single NAT gateway to save costs — in production, keep one per AZ.
ECS Fargate — serverless container hosting
Fargate is our go-to for startups because you get the flexibility of containers without managing EC2 instances. You pay only for the resources your containers actually use, and scaling is straightforward.
# terraform/modules/ecs/main.tf
variable "project_name" { type = string }
variable "environment" { type = string }
variable "vpc_id" { type = string }
variable "public_subnet_ids" { type = list(string) }
variable "private_subnet_ids" { type = list(string) }
variable "container_image" { type = string }
variable "container_port" {
type = number
default = 3000
}
variable "cpu" {
type = number
default = 256
}
variable "memory" {
type = number
default = 512
}
variable "desired_count" {
type = number
default = 2
}
resource "aws_ecs_cluster" "main" {
name = "${var.project_name}-${var.environment}"
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_cloudwatch_log_group" "app" {
name = "/ecs/${var.project_name}-${var.environment}"
retention_in_days = 30
}
resource "aws_iam_role" "ecs_task_execution" {
name = "${var.project_name}-${var.environment}-ecs-execution"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ecs-tasks.amazonaws.com"
}
}
]
})
}
resource "aws_iam_role_policy_attachment" "ecs_task_execution" {
role = aws_iam_role.ecs_task_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
resource "aws_security_group" "alb" {
name_prefix = "${var.project_name}-${var.environment}-alb-"
vpc_id = var.vpc_id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "ecs_tasks" {
name_prefix = "${var.project_name}-${var.environment}-ecs-"
vpc_id = var.vpc_id
ingress {
from_port = var.container_port
to_port = var.container_port
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_lb" "main" {
name = "${var.project_name}-${var.environment}-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = var.public_subnet_ids
}
resource "aws_lb_target_group" "app" {
name = "${var.project_name}-${var.environment}-tg"
port = var.container_port
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip"
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 30
matcher = "200"
}
}
resource "aws_lb_listener" "http" {
load_balancer_arn = aws_lb.main.arn
port = 80
protocol = "HTTP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.app.arn
}
}
resource "aws_ecs_task_definition" "app" {
family = "${var.project_name}-${var.environment}"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.ecs_task_execution.arn
container_definitions = jsonencode([
{
name = "app"
image = var.container_image
portMappings = [
{
containerPort = var.container_port
hostPort = var.container_port
protocol = "tcp"
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.app.name
"awslogs-region" = "us-east-1"
"awslogs-stream-prefix" = "ecs"
}
}
}
])
}
resource "aws_ecs_service" "app" {
name = "${var.project_name}-${var.environment}-app"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "app"
container_port = var.container_port
}
depends_on = [aws_lb_listener.http]
}
resource "aws_appautoscaling_target" "ecs" {
max_capacity = 10
min_capacity = var.desired_count
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.app.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cpu" {
name = "${var.project_name}-${var.environment}-cpu-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70.0
}
}
output "alb_dns_name" {
value = aws_lb.main.dns_name
}
This sets up an ECS Fargate service behind an Application Load Balancer with auto-scaling. Your containers run in private subnets (no direct internet exposure), health checks ensure traffic only routes to healthy instances, and the auto-scaling policy adds capacity when CPU utilization exceeds 70%. Start with 256 CPU / 512 MB memory for a typical Node.js API and scale up as needed.
RDS PostgreSQL — your managed database
Every startup needs a reliable database, and managed PostgreSQL gives you automated backups, failover, and patching without a dedicated DBA.
# terraform/modules/rds/main.tf
variable "project_name" { type = string }
variable "environment" { type = string }
variable "vpc_id" { type = string }
variable "private_subnet_ids" { type = list(string) }
variable "ecs_security_group_id" { type = string }
variable "instance_class" {
type = string
default = "db.t4g.micro"
}
variable "allocated_storage" {
type = number
default = 20
}
variable "db_name" {
type = string
default = "app"
}
resource "aws_db_subnet_group" "main" {
name = "${var.project_name}-${var.environment}"
subnet_ids = var.private_subnet_ids
tags = {
Name = "${var.project_name}-${var.environment}-db-subnet-group"
}
}
resource "aws_security_group" "rds" {
name_prefix = "${var.project_name}-${var.environment}-rds-"
vpc_id = var.vpc_id
ingress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [var.ecs_security_group_id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "random_password" "db_password" {
length = 32
special = false
}
resource "aws_secretsmanager_secret" "db_password" {
name = "${var.project_name}-${var.environment}-db-password"
}
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db_password.id
secret_string = random_password.db_password.result
}
resource "aws_db_instance" "main" {
identifier = "${var.project_name}-${var.environment}"
engine = "postgres"
engine_version = "16.3"
instance_class = var.instance_class
allocated_storage = var.allocated_storage
max_allocated_storage = var.allocated_storage * 5
db_name = var.db_name
username = "app_admin"
password = random_password.db_password.result
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.rds.id]
multi_az = var.environment == "production" ? true : false
publicly_accessible = false
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
deletion_protection = var.environment == "production" ? true : false
skip_final_snapshot = var.environment == "production" ? false : true
final_snapshot_identifier = var.environment == "production" ? "${var.project_name}-${var.environment}-final" : null
performance_insights_enabled = true
monitoring_interval = 60
tags = {
Name = "${var.project_name}-${var.environment}-postgres"
}
}
output "db_endpoint" {
value = aws_db_instance.main.endpoint
}
output "db_secret_arn" {
value = aws_secretsmanager_secret.db_password.arn
}
Key decisions in this configuration: the database password is generated automatically and stored in AWS Secrets Manager (never in code or environment variables). Multi-AZ failover is enabled for production but not staging — it doubles the cost and staging doesn't need five-nines availability. Storage auto-scaling is configured so you don't get paged at 3am because the disk filled up. And deletion_protection prevents accidental destruction of your production database.
Start with db.t4g.micro — it's enough for most startups through Series A. Move to db.r6g instances when you need more memory for complex queries.
S3 — storage with lifecycle policies
Object storage for uploads, backups, logs, and static assets. The lifecycle policies are critical — without them, storage costs grow linearly forever.
# terraform/modules/s3/main.tf
variable "project_name" { type = string }
variable "environment" { type = string }
resource "aws_s3_bucket" "uploads" {
bucket = "${var.project_name}-${var.environment}-uploads"
tags = {
Name = "${var.project_name}-${var.environment}-uploads"
}
}
resource "aws_s3_bucket_versioning" "uploads" {
bucket = aws_s3_bucket.uploads.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "uploads" {
bucket = aws_s3_bucket.uploads.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
bucket_key_enabled = true
}
}
resource "aws_s3_bucket_public_access_block" "uploads" {
bucket = aws_s3_bucket.uploads.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_lifecycle_configuration" "uploads" {
bucket = aws_s3_bucket.uploads.id
rule {
id = "transition-to-ia"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER_IR"
}
}
rule {
id = "cleanup-old-versions"
status = "Enabled"
noncurrent_version_transition {
noncurrent_days = 30
storage_class = "STANDARD_IA"
}
noncurrent_version_expiration {
noncurrent_days = 90
}
}
rule {
id = "abort-incomplete-uploads"
status = "Enabled"
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
}
resource "aws_s3_bucket" "logs" {
bucket = "${var.project_name}-${var.environment}-logs"
}
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "expire-old-logs"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
expiration {
days = 365
}
}
}
output "uploads_bucket_name" {
value = aws_s3_bucket.uploads.id
}
output "uploads_bucket_arn" {
value = aws_s3_bucket.uploads.arn
}
output "logs_bucket_name" {
value = aws_s3_bucket.logs.id
}
The lifecycle policies are the most important part here. Files accessed frequently stay in S3 Standard. After 30 days, they move to Infrequent Access (about 45% cheaper). After 90 days, they move to Glacier Instant Retrieval (about 68% cheaper). Old object versions are cleaned up after 90 days, and incomplete multipart uploads are aborted after a week. Without these rules, we routinely see startups spending 3-4x what they need to on S3.
Deploying it all
Wire the modules together and deploy:
# terraform/environments/production/main.tf (add to existing)
variable "aws_region" {
type = string
default = "us-east-1"
}
variable "environment" {
type = string
default = "production"
}
variable "project_name" {
type = string
default = "myapp"
}
module "vpc" {
source = "../../modules/vpc"
project_name = var.project_name
environment = var.environment
}
module "ecs" {
source = "../../modules/ecs"
project_name = var.project_name
environment = var.environment
vpc_id = module.vpc.vpc_id
public_subnet_ids = module.vpc.public_subnet_ids
private_subnet_ids = module.vpc.private_subnet_ids
container_image = "your-account.dkr.ecr.us-east-1.amazonaws.com/myapp:latest"
desired_count = 2
}
module "rds" {
source = "../../modules/rds"
project_name = var.project_name
environment = var.environment
vpc_id = module.vpc.vpc_id
private_subnet_ids = module.vpc.private_subnet_ids
ecs_security_group_id = module.ecs.ecs_security_group_id
instance_class = "db.t4g.small"
}
module "s3" {
source = "../../modules/s3"
project_name = var.project_name
environment = var.environment
}
# Initialize and deploy
cd terraform/environments/production
terraform init
terraform plan -out=tfplan
terraform apply tfplan
Always run terraform plan first and review the output. Never apply without reviewing. In a team setting, run Terraform through CI/CD (we'll cover that in a future post) so that every infrastructure change goes through code review.
The bottom line
Infrastructure as code with Terraform isn't just a best practice — it's a force multiplier. Once these modules exist, spinning up a new environment takes minutes instead of days. New team members can read the code to understand your infrastructure. And every change is tracked, reviewed, and reversible.
Start with these four modules — VPC, compute, database, and storage — and expand from there. Your future self (and your future ops hire) will thank you.