DevOps Cost Optimization Checklist: 30 Checks Across Compute, Storage, and Observability

Cloud bills don't explode overnight — they accumulate through dozens of small, individually invisible decisions: a Kubernetes pod with no resource limits, a CloudWatch log group retaining data for eternity, a CI runner spinning on a c5.4xlarge to compile a 300-line Go service. The compounding effect is a bill that doubles every 18 months with no corresponding doubling in value.

This checklist is the result of FinOps reviews across multiple production environments. It's organized into six domains, each with actionable checks you can start this week. Target savings per domain are included based on real-world benchmarks.

FinOps Principle: Cost optimization is not a one-time project — it's an engineering discipline. The teams that win are the ones that make cost visibility a first-class requirement alongside reliability and performance.

🖥️ 1. Compute & Kubernetes (Target: 30–50% savings)

Compute is almost always the largest line item. Most environments run at 15–30% average CPU utilization on standard instances.

✅ Set resource requests AND limits on every Kubernetes container. Without requests, the scheduler cannot bin-pack nodes efficiently. Without limits, a noisy neighbour OOMs other pods. Both lead to over-provisioning.
✅ Enable the Kubernetes Vertical Pod Autoscaler (VPA) in recommendation mode. Run it for 7 days and collect its lowerBound / upperBound output before updating manifests.
✅ Migrate stateless workloads to Spot / Preemptible instances. AWS Spot Instances save 60–90% vs On-Demand. Use a mixed instance policy in your ASG or Karpenter node pool.
✅ Purchase Savings Plans or Reserved Instances for baseline compute. Commit only to your steady-state floor (p10 of your hourly usage). Use Spot above that.
✅ Enable Cluster Autoscaler or Karpenter. Scale node groups down to zero during off-peak. A dev cluster running overnight with zero pods still costs money.
✅ Schedule non-production environments off outside business hours. Implement a Lambda + EventBridge rule to stop EC2 and scale ECS/EKS node groups to zero at 8 PM and restart at 8 AM.

karpenter-nodepool.yaml — Spot + On-Demand mixed

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]   # prefer Spot, fallback to OD
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m5.large
            - m5.xlarge
            - m6i.large
            - m6i.xlarge
            - c5.large
            - c6i.large
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s   # remove underutilized nodes quickly
  limits:
    cpu: 1000
    memory: 2000Gi

💾 2. Storage (Target: 20–40% savings)

✅ Audit and delete orphaned EBS volumes. Volumes in available state are detached and still billed. Find them with a single AWS CLI command.
✅ Set S3 lifecycle rules on every bucket. Move objects to Intelligent-Tiering or Glacier after 30–90 days. Delete incomplete multipart uploads (a silent cost many teams miss).
✅ Migrate frequently-accessed snapshots to gp3; delete the rest. EBS snapshots older than 90 days are rarely needed. Build a Lambda that enforces a retention window.
✅ Use gp3 over gp2 for EBS volumes. gp3 is 20% cheaper and lets you provision IOPS/throughput independently, so you don't need to over-size the volume just for performance.
✅ Compress and deduplicate CloudWatch Logs before shipping to S3. Use subscription filters to stream logs to Kinesis Firehose → S3 + Athena for long-term querying at a fraction of CloudWatch storage cost.

find-orphaned-ebs.sh

#!/usr/bin/env bash
# List all detached EBS volumes and their monthly cost estimate

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId, Size:Size, Type:VolumeType, AZ:AvailabilityZone}' \
  --output table

# Estimate: gp3 costs $0.08/GB-month
# Total cost = (sum of sizes) * 0.08

📡 3. Data Transfer & Networking (Target: 15–30% savings)

Data transfer is one of the most underestimated cost categories. Traffic leaving your VPC to the internet is billed at $0.09/GB in most AWS regions. Cross-AZ traffic within your own account costs $0.01/GB — in both directions.

✅ Deploy services in the same AZ when they talk to each other heavily. Use topology-aware routing in Kubernetes (topologySpreadConstraints) and set trafficDistribution: PreferClose on Services (K8s 1.31+) to keep traffic within-AZ.
✅ Use VPC Endpoints for AWS service traffic. Traffic from EC2/Lambda to S3, DynamoDB, ECR, or SSM via the public internet is billed as data transfer AND is a security risk. VPC Gateway Endpoints for S3/DynamoDB are free.
✅ Audit your NAT Gateway cost. NAT Gateway charges $0.045/GB processed in addition to hourly cost. If your workloads are downloading large artifacts (Docker images, npm packages) via NAT, consider ECR pull-through cache or self-hosted artifact registries inside the VPC.
✅ Use CloudFront for egress-heavy workloads. CloudFront-to-S3 origin transfer is free. CloudFront egress pricing tiers down significantly at scale vs direct S3 egress.
✅ Release unused Elastic IP addresses. Unattached EIPs cost $0.005/hour = ~$3.60/month each. Find them and release.

📊 4. Observability & Logging (Target: 20–35% savings)

Observability infrastructure is one of the fastest-growing cost centres as teams scale. CloudWatch, Datadog, Grafana Cloud, and similar tools bill by volume — logs, metrics, and spans.

✅ Set CloudWatch Log Group retention policies. Default retention is Never Expire. Set a maximum of 90 days for application logs; 1 year for audit/security logs. Archive to S3 for longer retention.
✅ Drop noisy, low-value logs at the source. Use a FluentBit grep or rewrite_tag filter to drop health check logs (GET /healthz 200) before they hit CloudWatch or Loki. These can account for 30–60% of total log volume.
✅ Downsample high-cardinality metrics. Recording 1-second Prometheus scrape intervals for 500 services is expensive. Evaluate which metrics genuinely need sub-minute granularity and scrape the rest at 60s.
✅ Use head-based or tail-based sampling for distributed traces. Storing 100% of traces in Jaeger, Tempo, or AWS X-Ray is almost never needed. 1–5% head-based sampling with 100% capture on errors is the industry standard.
✅ Audit custom metric dimensions. In CloudWatch, each unique metric dimension combination is a separate billable metric. A single metric with 5 high-cardinality dimensions can produce millions of billable metric streams.

fluent-bit — drop healthcheck logs

[FILTER]
    Name    grep
    Match   app.*
    Exclude log  GET /healthz
    Exclude log  GET /readyz
    Exclude log  GET /metrics

🗄️ 5. Database & Cache (Target: 25–45% savings)

✅ Rightsize RDS and Aurora instances. Use AWS Compute Optimizer or RDS Performance Insights to find instances running at under 10% average CPU. Downgrade instance class or switch to Serverless v2 for variable workloads.
✅ Switch RDS Multi-AZ to a read replica architecture for read-heavy workloads. Multi-AZ doubles your instance cost for HA. For read-heavy applications, a single primary + read replica can be cheaper and provide better read scalability.
✅ Evaluate Aurora Serverless v2 for dev/staging databases. Aurora Serverless v2 scales to 0 ACUs when idle and bills per second. Perfect for environments that aren't used 24/7.
✅ Reduce ElastiCache node sizes and enable data tiering. ElastiCache data tiering (SSD-backed) on r6gd nodes can cut in-memory costs by 60% for workloads with mixed hot/cold data access patterns.
✅ Set DynamoDB table billing mode to PAY_PER_REQUEST for low-traffic tables. Provisioned capacity with auto-scaling often over-provisions during quiet periods. On-demand billing is simpler and cheaper for tables with spiky or low traffic.
✅ Audit RDS automated backup retention windows. A 35-day backup window on a 500 GB database creates significant snapshot storage cost. Reduce to 7–14 days and use manual snapshots for long-term compliance.

🏷️ 6. FinOps: Tagging & Visibility (Foundation for everything else)

You cannot optimize what you cannot attribute. Without a consistent tagging strategy, your cost explorer shows a wall of unallocated spend and no team can be held accountable for their resource usage.

✅ Enforce a mandatory tag policy via AWS Organizations SCP. Require Environment, Team, Service, and CostCenter tags on all taggable resources. Reject resource creation that omits them.
✅ Enable AWS Cost Allocation Tags and activate them in Cost Explorer. Tags only appear in Cost Explorer 24 hours after activation. Do this now.
✅ Set up Cost Anomaly Detection. AWS Cost Anomaly Detection uses ML to flag unexpected spend spikes. Configure monitors per service and per linked account with an SNS alert to your #finops Slack channel.
✅ Create per-team AWS Budgets with alert thresholds at 80% and 100%. Budget alerts via SNS → Slack or email keep engineers aware of spend before it becomes a month-end surprise.

tag-policy.json — AWS Organizations

{
  "tags": {
    "Environment": {
      "tag_key": { "@@assign": "Environment" },
      "tag_value": {
        "@@assign": ["dev", "staging", "production"]
      },
      "enforced_for": {
        "@@assign": ["ec2:instance", "rds:db", "s3:bucket", "eks:cluster"]
      }
    },
    "Team": {
      "tag_key": { "@@assign": "Team" },
      "enforced_for": {
        "@@assign": ["ec2:instance", "rds:db", "lambda:function"]
      }
    },
    "CostCenter": {
      "tag_key": { "@@assign": "CostCenter" },
      "enforced_for": {
        "@@assign": ["ec2:instance", "rds:db"]
      }
    }
  }
}

⚙️ 7. CI/CD Pipeline Costs (Target: 20–40% savings)

✅ Right-size CI runner instance types. Most build jobs are I/O-bound, not CPU-bound. A c5.large (2 vCPU, 4 GB) often runs builds as fast as a c5.4xlarge for typical application code. Benchmark your build times vs instance class.
✅ Cache dependencies aggressively. In GitHub Actions, use actions/cache for node_modules, .gradle, pip, Maven, and Docker layer caches. Uncached builds that re-download 2 GB of npm packages waste both time and runner minutes.
✅ Cancel redundant workflow runs on push. When a developer pushes 3 commits in quick succession, only the latest matters. Use concurrency groups in GitHub Actions to auto-cancel in-progress runs.
✅ Run lint and unit tests in parallel, not sequentially. Splitting a sequential 20-minute pipeline into parallel jobs of 7 minutes cuts per-commit runner cost significantly and improves developer feedback time.

GitHub Actions — concurrency + cache

name: CI

on:
  push:
    branches: [main, 'feature/**']

# Cancel in-progress runs for the same branch
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Cache Node modules
        uses: actions/cache@v4
        with:
          path: ~/.npm
          key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
          restore-keys: |
            ${{ runner.os }}-node-

      - run: npm ci
      - run: npm test

Quick-Win Priority Matrix

priority-matrix

Action                                  │ Effort │ Typical Saving
────────────────────────────────────────┼────────┼────────────────
Release unattached EIPs                 │  Low   │ $3–$50/mo
Delete orphaned EBS volumes             │  Low   │ $20–$500/mo
Set CW Log Group retention (90 days)    │  Low   │ $50–$800/mo
Enable Spot for dev/staging EC2         │  Low   │ 60–90% compute
Drop healthcheck logs in FluentBit      │  Low   │ 20–40% log cost
gp2 → gp3 EBS migration                │  Med   │ 20% storage
Add S3 lifecycle rules                  │  Med   │ 30–60% S3 cost
VPC Endpoints for S3/DynamoDB           │  Med   │ $100–$2k/mo
Kubernetes VPA + Karpenter              │  High  │ 30–50% compute
Savings Plans / Reserved Instances      │  High  │ 30–40% baseline

Conclusion

A structured FinOps practice doesn't require a dedicated team or expensive tooling. Start with the low-effort, high-return wins in the matrix above — orphaned EIPs, EBS volumes, and log retention policies. Each of these can be completed in under an hour and often saves hundreds of dollars immediately.

Then build the discipline: enforce tags, configure anomaly detection, and review Cost Explorer weekly as part of your team's sprint rituals. Cost optimization is an engineering problem, and like any engineering problem, it rewards systematic thinking over heroic one-time cleanups.