Cloud Cost Optimization in 2026: Cut Your AWS Bill Without Cutting Capabilities

The cloud bill problem

Here's a pattern we see constantly: a company migrates to the cloud, things work great, the team moves fast -- and then the CFO starts asking questions. "Why is our AWS bill $30K this month? It was $12K six months ago. What changed?"

Usually, nobody knows. Cloud costs creep up gradually. An oversized instance here, a forgotten test environment there, a data transfer pattern nobody thought about. By the time someone notices, you're overspending by 25-40% -- and untangling the waste from the essential spend feels overwhelming.

It doesn't have to be. The biggest savings come from a handful of well-understood patterns, and most of them can be implemented in a few weeks without touching your application code.

The top 5 cost drivers (and how to fix them)

1. Oversized instances

The problem: This is the single biggest source of cloud waste. Teams provision instances based on worst-case estimates or "it was running slow so we upgraded," then never right-size. We routinely find production instances running at 5-15% average CPU utilization -- meaning you're paying for 6-20x more compute than you need.

The fix: Use AWS Compute Optimizer. It analyzes 14 days of utilization data and recommends the right instance size. For most workloads, this alone cuts compute costs by 20-35%.

bash

# Enable Compute Optimizer (do this once)
aws compute-optimizer update-enrollment-status \
  --status Active \
  --include-member-accounts

# Get recommendations for EC2 instances
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=Finding,values=OVER_PROVISIONED \
  --output table

Graviton instances are the other major lever here. AWS's ARM-based Graviton3 processors deliver up to 25% better performance at 20% lower cost compared to equivalent x86 instances. For most workloads (Node.js, Python, Java, Go, containerized applications), switching from m6i to m7g is a drop-in change that immediately cuts your compute bill.

hcl

# Terraform: switch to Graviton
resource "aws_instance" "app" {
  # Before: x86 instance
  # instance_type = "m6i.xlarge"    # $0.192/hr

  # After: Graviton instance -- 20% cheaper, 25% faster
  instance_type = "m7g.xlarge"      # $0.1632/hr

  # Make sure your AMI supports ARM64
  ami = data.aws_ami.amazon_linux_arm64.id
}

2. Idle and forgotten resources

The problem: Test environments that nobody shut down. Load balancers with no targets. EBS volumes detached from any instance. Elastic IPs not attached to anything (these actually cost more idle than in use). Old snapshots accumulating daily.

The fix: Run a monthly cleanup sweep. These AWS CLI commands identify the most common offenders:

bash

# Find unattached EBS volumes (you're paying for these)
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

# Find unused Elastic IPs ($3.65/month each when unattached)
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].{IP:PublicIp,AllocationId:AllocationId}' \
  --output table

# Find idle load balancers (no healthy targets)
aws elbv2 describe-target-health \
  --query 'TargetHealthDescriptions[?TargetHealth.State!=`healthy`]'

# Find old snapshots (older than 90 days)
aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-12-01`].{ID:SnapshotId,Size:VolumeSize,Date:StartTime}' \
  --output table

In our experience, the first cleanup sweep typically recovers 5-10% of total cloud spend. Set a monthly calendar reminder -- resources drift back to idle faster than you'd think.

3. Unoptimized storage

The problem: All S3 data stored in Standard tier regardless of access patterns. No lifecycle policies, so storage costs grow linearly forever. EBS volumes using gp2 instead of the newer (and cheaper) gp3. RDS storage over-provisioned and never reclaimed.

The fix: Enable S3 Intelligent-Tiering for any bucket where you can't predict access patterns. It automatically moves objects between access tiers based on usage -- and there's no retrieval fee.

hcl

# Terraform: S3 Intelligent-Tiering
resource "aws_s3_bucket_intelligent_tiering_configuration" "default" {
  bucket = aws_s3_bucket.data.id
  name   = "EntireBucket"

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }
}

# Lifecycle policy for known cold data
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id

  rule {
    id     = "archive-old-logs"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"
    }

    expiration {
      days = 365
    }
  }
}

For EBS, upgrade all gp2 volumes to gp3. It's 20% cheaper with better baseline performance, and you can do it live without downtime:

bash

# Modify a gp2 volume to gp3 (no downtime)
aws ec2 modify-volume \
  --volume-id vol-0123456789abcdef0 \
  --volume-type gp3

4. Missing reserved capacity

The problem: You're paying on-demand pricing for workloads that run 24/7. On-demand is the most expensive pricing tier AWS offers -- it's designed for unpredictable, short-term workloads, not for your production database that's been running continuously for two years.

The fix: AWS Savings Plans are the modern replacement for Reserved Instances. They're simpler, more flexible, and cover EC2, Fargate, and Lambda.

The strategy:

Analyze 30-60 days of usage in AWS Cost Explorer to understand your baseline
Commit to Compute Savings Plans for your steady-state baseline (typically 40-60% of your total compute). These offer up to 66% savings and apply automatically across instance types and regions.
Use EC2 Instance Savings Plans for workloads you're confident won't change instance family. These offer up to 72% savings but are less flexible.
Keep the remainder on-demand for variable and experimental workloads.

Need help implementing this? Our team can help you put these practices into action.

hcl

# Terraform: auto-scaling policy to optimize reserved + on-demand mix
resource "aws_autoscaling_group" "app" {
  name                = "app-production"
  min_size            = 2    # Covered by Savings Plans
  max_size            = 10   # Burst capacity on-demand
  desired_capacity    = 3

  mixed_instances_policy {
    instances_distribution {
      # Use on-demand for baseline (covered by Savings Plans)
      on_demand_base_capacity                  = 2
      on_demand_percentage_above_base_capacity = 0
      # Use Spot for burst capacity (up to 90% cheaper)
      spot_allocation_strategy = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.app.id
        version            = "$Latest"
      }

      # Graviton instance options for cost optimization
      override {
        instance_type = "m7g.large"
      }
      override {
        instance_type = "m6g.large"
      }
      override {
        instance_type = "c7g.large"
      }
    }
  }
}

5. Data transfer costs

The problem: Data transfer is the line item that surprises everyone. Inter-AZ traffic, NAT gateway charges, API Gateway data processing, CloudFront to origin requests -- they add up quietly and can represent 10-15% of your total bill.

The fix:

Use VPC endpoints for S3 and DynamoDB access from within your VPC. This eliminates NAT gateway charges for those services (NAT gateway processing is $0.045/GB).
Keep traffic within the same AZ where possible. Inter-AZ transfer costs $0.01/GB in each direction -- this adds up fast for chatty microservices.
Use CloudFront for API responses, not just static assets. Caching even 30% of your API responses at the edge significantly reduces origin transfer costs.
Review NAT gateway usage. NAT gateways process all outbound traffic from private subnets. If your services are making heavy outbound calls, consider placing them in public subnets with security groups instead.

hcl

# Terraform: VPC endpoints to eliminate NAT gateway charges
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.ca-central-1.s3"
  vpc_endpoint_type = "Gateway"

  route_table_ids = aws_route_table.private[*].id

  tags = {
    Name = "s3-vpc-endpoint"
  }
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.ca-central-1.dynamodb"
  vpc_endpoint_type = "Gateway"

  route_table_ids = aws_route_table.private[*].id

  tags = {
    Name = "dynamodb-vpc-endpoint"
  }
}

Tools for visibility

You can't optimize what you can't see. These three tools give you the visibility you need:

AWS Cost Explorer is your starting point. Enable hourly granularity and tag all resources consistently. Group by service, tag, or linked account to understand where money is going. Set up daily cost anomaly detection to catch unexpected spikes before they become expensive.

Kubecost is essential if you're running Kubernetes. It breaks down costs per namespace, deployment, and pod -- something AWS billing can't do natively. Most teams running K8s discover that 2-3 deployments account for 60-70% of their cluster costs.

Infracost integrates into your Terraform pull request workflow and shows the cost impact of infrastructure changes before they're applied. A developer adding a new RDS instance sees "+$180/month" right in the PR -- which naturally encourages cost-conscious decisions.

yaml

# .github/workflows/infracost.yml
name: Infracost
on: [pull_request]

jobs:
  infracost:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Generate cost estimate
        run: |
          infracost breakdown --path=terraform/ \
            --format=json --out-file=/tmp/infracost.json

      - name: Post PR comment
        uses: infracost/actions/comment@v3
        with:
          path: /tmp/infracost.json
          behavior: update

Case study: Northvex -- from $85K to $49K per month

One of our clients, Northvex (a Waterloo-based B2B SaaS company), came to us with an AWS bill that had tripled over 12 months to $85K/month. Nobody on the team could explain why. Here's what we found and fixed during a 14-week engagement:

Right-sizing was the biggest win. Their production instances were running at 8-12% average CPU utilization. We right-sized across the board and migrated to Graviton instances, cutting compute costs by roughly 35%.

Reserved capacity was the second lever. They had zero commitment-based pricing -- every instance was on-demand. We analyzed 90 days of usage, purchased Compute Savings Plans for their baseline, and saved an additional 25% on steady-state compute.

Storage optimization was the quiet win. Three years of application logs in S3 Standard. No lifecycle policies anywhere. We implemented Intelligent-Tiering, moved cold data to Glacier, and added lifecycle policies that automatically clean up after 365 days. Storage costs dropped by 60%.

Architecture changes handled the rest. We moved to ECS Fargate for auto-scaling during peak loads (instead of permanently over-provisioned EC2 instances), implemented CloudFront for static assets and API caching, and added VPC endpoints to eliminate NAT gateway charges for S3 traffic.

Result: $85K/month down to $49K/month -- a 42% reduction with no loss of performance. In fact, peak load response times improved 3x thanks to the auto-scaling and caching improvements. The full case study is available on our case studies page.

A monthly cost review checklist

Run through this checklist on the first Monday of every month:

Review Cost Explorer for unexpected increases by service
Check Compute Optimizer for new right-sizing recommendations
Run the idle resource cleanup commands (EBS, EIPs, snapshots)
Review Savings Plan utilization -- are you using what you committed to?
Check S3 storage growth -- are lifecycle policies keeping up?
Review data transfer line items for unexpected growth
Update resource tags for any new infrastructure (accurate tagging = accurate cost allocation)

The bottom line

Cloud cost optimization isn't a one-time project -- it's a monthly practice. The five levers above (right-sizing, cleanup, storage optimization, reserved capacity, and data transfer) cover the vast majority of cloud waste. Start with right-sizing and cleanup -- they require no commitments and deliver immediate savings.

The companies that manage cloud costs well aren't spending less on cloud. They're spending the right amount -- and they can explain every dollar.

Cloud Cost Optimization in 2026: Cut Your AWS Bill Without Cutting Capabilities

The cloud bill problem

The top 5 cost drivers (and how to fix them)

1. Oversized instances

2. Idle and forgotten resources

3. Unoptimized storage

4. Missing reserved capacity

5. Data transfer costs

Tools for visibility

Case study: Northvex -- from $85K to $49K per month

A monthly cost review checklist

The bottom line

Need help implementing this?

Get engineering insights delivered

More articles

Building for Scale: A Startup CTO's Technology Playbook

7 Infrastructure Mistakes Every Startup Makes (And How to Fix Them)