A real-time bidding (RTB) platform processing continuous high-throughput auction traffic was paying $135,000 per month for managed hosting. Twelve months later, after a phased migration to AWS, an aggressive spot-instance strategy and a rework of the workload's elasticity assumptions, the same platform was running on $35,000 per month — a 72% reduction with no measurable impact on bid latency or fill rate.
The problem
The platform ran on a managed-hosting provider with a mix of bare-metal servers and a private virtualisation layer. Roughly 400 servers in total: ad-server fleet, RTB bidders, a Hadoop cluster for analytics, Kafka for event streams, Druid for OLAP queries, MySQL clusters, MongoDB clusters, and tracker servers handling impression/click pings.
Three things had quietly broken the cost model:
- Capacity bought for peak, paid for at peak. The platform's QPS varied by 4–5× across the day, but capacity was sized for the daily peak and billed flat. Off-peak compute was idle and expensive.
- No tier separation between latency-critical and batch workloads. The RTB bidders (sub-100ms latency budget) ran on the same class of hardware as the analytics jobs (latency-indifferent). Both got premium hosting; only one needed it.
- Storage and bandwidth that nobody was monitoring. Tracker server logs were being shipped to a NAS at full retention, indefinitely. Cross-region replication for "DR" had been configured years earlier, never reviewed, never tested.
The constraints
Three things shaped the migration plan:
- Zero downtime on the bid path. Lost bids equal lost revenue. The RTB fleet had to migrate live, with traffic running, behind a load balancer that could fail back to the legacy stack instantly.
- Bid latency budget < 100ms. Spot instances could disrupt the bid path if not handled correctly. Fail-fast and graceful drain mattered more than they would on most workloads.
- No engineering hire allowed. The migration had to be delivered with the existing team plus me. No backfill budget, no consultancy multipliers.
The architecture decision: split the workload by tier
The single most impactful decision was splitting the platform into three workload tiers, each with its own pricing and scheduling model:
Tier 1 stayed on on-demand and reserved capacity — about 30% of total compute — because the bid path's latency budget made spot interruptions too expensive. Tier 2 moved entirely to spot, with diversified instance pools to reduce simultaneous-interruption risk. Tier 3 (analytics, log retention, replication) was the largest source of waste and the easiest to rework: most data tiered to S3, lifecycle policies pruned cold data automatically, batch jobs ran in scheduled windows on spot.
Implementation notes
The migration ran live behind a fronting load balancer
Each component migrated by spinning up the AWS-side replica, joining it to the production load balancer at low traffic weight, ramping the weight over a week of observation, then decommissioning the legacy node. Any AWS-side issue meant rolling back a load-balancer weight, not rolling back a deploy. The team could move fast because the rollback mechanism was instant.
Spot interruption handling, properly
The biggest spot mistake people make is treating an interruption as an exception. Treat it as the normal case. Every Tier 2 worker received the spot interruption notice (the two-minute warning) as an event, gracefully drained its work, checkpointed to S3, and exited. The auto-scaling group brought up a replacement from a different instance pool. The job queue retried. No human involved.
Spot diversification across instance families and AZs meant the probability of simultaneous interruption across the whole fleet was effectively zero, even during AWS pricing spikes.
The "DR" that nobody had tested
The legacy setup had cross-region replication configured for disaster recovery. Nobody had run a DR drill in years. We tested it as part of the migration audit and discovered the replication had been silently broken for ~18 months. That alone was a customer-facing risk that the migration surfaced and resolved — the new AWS architecture had multi-AZ resilience by default and a documented DR drill scheduled quarterly.
Monitoring 400 servers without enterprise APM cost
The legacy monitoring was a patchwork of vendor agents at substantial per-host pricing. The replacement was Nagios + Bash for host-level checks and Grafana on top of a self-hosted metrics backend for trends and dashboards — a deliberate choice to avoid recreating the per-host licence cost on AWS. The team owned the monitoring stack outright; cost scaled with infrastructure not vendor pricing.
What didn't work
- First attempt at putting RTB bidders on spot. Worked in dev. In production, spot price spikes correlated across instance families and we lost too much fleet at once. Reverted to on-demand for Tier 1 and stopped trying to be clever about it.
- Aggressive S3 lifecycle on log data. Initially set 30-day retention on tracker logs. Discovered the analytics team was running monthly cohort queries that needed 90 days. Pushed retention to 120 days to be safe; still saved 70% vs unlimited retention.
- MongoDB on spot. Tempting; bad idea. Stateful databases on spot is a foot-gun unless you have very good replica management. Kept on on-demand.
Results
Cost breakdown — before and after
| Tier | Before (legacy) | After (AWS) | Saving |
|---|---|---|---|
| Latency-critical (RTB, ad servers) | ~$58K | ~$18K | 69% |
| Variable batch (Hadoop, Druid, pipelines) | ~$42K | ~$8K | 81% |
| Storage & analytics | ~$20K | ~$5K | 75% |
| Networking, monitoring, misc | ~$15K | ~$4K | 73% |
| Total monthly | $135K | $35K | 72% |
Figures rounded for clarity. Actual savings varied month-to-month within ±5% of these averages.
Beyond the headline number, three softer wins:
- Deployment time fell ~50%. The migration forced everything through Bash + cloud-init + auto-scaling-group bootstraps — repeatable, scriptable, fast. Deploys that took 30+ minutes manually now ran in under 15.
- Downtime dropped ~30%. Largely an architectural side-effect: multi-AZ by default, health-checked load balancers, and spot fleet replacement removed entire classes of outage that the legacy stack had.
- Operations team got their evenings back. No more weekend emergency hardware swaps. Auto-scaling and self-healing replaced what used to be human pages.
What I'd do differently today
This migration was a decade ago. The same outcome today would use different specific tools, but the principles are unchanged. The current playbook would be:
- Terraform or CDK from day one, not Bash + cloud-init. Reproducibility costs less when written upfront than retrofitted.
- EKS for the Tier 2 batch workloads, with Karpenter for spot-aware scheduling. Karpenter handles diversification and interruption better than 2014-era auto-scaling groups.
- Savings Plans + RIs as a base layer, spot on top, rather than just on-demand + spot. Compute Savings Plans flex across instance family and region; they're a strictly better default than RIs for most teams now.
- FinOps observability earlier. Daily cost-by-tag dashboards from day one, not "we'll add it later." A 72% reduction is impressive but happened partly because waste had been accumulating invisibly for years; a real-time view would have caught it before it got that bad.
Tools used
AWS EC2 · ELB · S3 · Auto Scaling · RDS · Spot Fleet · Bash · Cloud-init · Hadoop · Druid · Kafka · MySQL · MongoDB · Nagios · Grafana · Linux.
Got a cloud bill that's grown faster than the workload?
I work with platform and engineering teams to find the 30–70% saving that's already in their AWS account. Outside IR35, Inside IR35, permanent or fractional engagements.
Schedule a call →