Case study · FinOps

Reducing AWS Infrastructure Costs by 72% Through Cloud Migration and Spot Strategy

↓ 72% cost $135K → $35K / month AdTech · RTB platform 400 servers migrated AWS · spot instances

A real-time bidding (RTB) platform processing continuous high-throughput auction traffic was paying $135,000 per month for managed hosting. Twelve months later, after a phased migration to AWS, an aggressive spot-instance strategy and a rework of the workload's elasticity assumptions, the same platform was running on $35,000 per month — a 72% reduction with no measurable impact on bid latency or fill rate.

The problem

The platform ran on a managed-hosting provider with a mix of bare-metal servers and a private virtualisation layer. Roughly 400 servers in total: ad-server fleet, RTB bidders, a Hadoop cluster for analytics, Kafka for event streams, Druid for OLAP queries, MySQL clusters, MongoDB clusters, and tracker servers handling impression/click pings.

Three things had quietly broken the cost model:

The constraints

Three things shaped the migration plan:

The architecture decision: split the workload by tier

The single most impactful decision was splitting the platform into three workload tiers, each with its own pricing and scheduling model:

Tier 1 — Latency-critical (RTB bidders, ad servers) Compute: On-demand EC2 (small headroom) + Reserved Instances for baseline Scheduling: Auto-scaling group with health checks Why: Sub-100ms requirement; spot interruption cost > spot saving for this tier Tier 2 — Variable batch (Hadoop, Druid loaders, log pipelines) Compute: EC2 spot fleet with diversified pools Scheduling: Job-queue retries; checkpoint to S3 Why: Latency-indifferent; spot saved 60–80% Tier 3 — Storage and offline analytics Compute: Spot + S3 + lifecycle policies Scheduling: Daily batch windows Why: Cold data was being treated as hot

Tier 1 stayed on on-demand and reserved capacity — about 30% of total compute — because the bid path's latency budget made spot interruptions too expensive. Tier 2 moved entirely to spot, with diversified instance pools to reduce simultaneous-interruption risk. Tier 3 (analytics, log retention, replication) was the largest source of waste and the easiest to rework: most data tiered to S3, lifecycle policies pruned cold data automatically, batch jobs ran in scheduled windows on spot.

Implementation notes

The migration ran live behind a fronting load balancer

Each component migrated by spinning up the AWS-side replica, joining it to the production load balancer at low traffic weight, ramping the weight over a week of observation, then decommissioning the legacy node. Any AWS-side issue meant rolling back a load-balancer weight, not rolling back a deploy. The team could move fast because the rollback mechanism was instant.

Spot interruption handling, properly

The biggest spot mistake people make is treating an interruption as an exception. Treat it as the normal case. Every Tier 2 worker received the spot interruption notice (the two-minute warning) as an event, gracefully drained its work, checkpointed to S3, and exited. The auto-scaling group brought up a replacement from a different instance pool. The job queue retried. No human involved.

Spot diversification across instance families and AZs meant the probability of simultaneous interruption across the whole fleet was effectively zero, even during AWS pricing spikes.

The "DR" that nobody had tested

The legacy setup had cross-region replication configured for disaster recovery. Nobody had run a DR drill in years. We tested it as part of the migration audit and discovered the replication had been silently broken for ~18 months. That alone was a customer-facing risk that the migration surfaced and resolved — the new AWS architecture had multi-AZ resilience by default and a documented DR drill scheduled quarterly.

Monitoring 400 servers without enterprise APM cost

The legacy monitoring was a patchwork of vendor agents at substantial per-host pricing. The replacement was Nagios + Bash for host-level checks and Grafana on top of a self-hosted metrics backend for trends and dashboards — a deliberate choice to avoid recreating the per-host licence cost on AWS. The team owned the monitoring stack outright; cost scaled with infrastructure not vendor pricing.

What didn't work

Results

↓ 72%Monthly infra cost
↓ 30%Production downtime (architectural side-effect)
↓ 50%Deployment time (automation)

Cost breakdown — before and after

TierBefore (legacy)After (AWS)Saving
Latency-critical (RTB, ad servers)~$58K~$18K69%
Variable batch (Hadoop, Druid, pipelines)~$42K~$8K81%
Storage & analytics~$20K~$5K75%
Networking, monitoring, misc~$15K~$4K73%
Total monthly$135K$35K72%

Figures rounded for clarity. Actual savings varied month-to-month within ±5% of these averages.

Beyond the headline number, three softer wins:

What I'd do differently today

This migration was a decade ago. The same outcome today would use different specific tools, but the principles are unchanged. The current playbook would be:

Tools used

AWS EC2 · ELB · S3 · Auto Scaling · RDS · Spot Fleet · Bash · Cloud-init · Hadoop · Druid · Kafka · MySQL · MongoDB · Nagios · Grafana · Linux.

Got a cloud bill that's grown faster than the workload?

I work with platform and engineering teams to find the 30–70% saving that's already in their AWS account. Outside IR35, Inside IR35, permanent or fractional engagements.

Schedule a call →