FinTech · DevOps & Security

Zero-Downtime
Cloud Migration

How we migrated 14 on-premises services to Azure with zero downtime during peak trading hours — achieving 99.99% uptime and reducing infrastructure costs by 42%.

0sDowntime

99.99%Uptime SLA

42%Cost Reduction

14Services Migrated

Overview

Client Context

A New York–based financial technology firm processed 2.3 million transactions daily through 14 interconnected services running on aging on-premises hardware. Their co-location contract was expiring in 6 months, the hardware was EOL, and the vendor quoted a 65% price increase for renewal.

The challenge: the platform could not tolerate even 30 seconds of downtime. Trading operations ran 22 hours a day (closed only during the 2-hour nightly settlement window). Any service interruption during market hours would trigger regulatory reporting obligations and potential fines exceeding $500K per incident.

They needed a complete migration to Azure — infrastructure, databases, message queues, and monitoring — without a single second of user-facing downtime.

Azure Kubernetes Service Terraform GitHub Actions Azure SQL Redis Enterprise Azure Service Bus Datadog HashiCorp Vault

The Challenge

Migration Without Interruption

Traditional "lift and shift" migration was impossible. The services were tightly coupled, the databases used features specific to SQL Server on-prem, and the deployment process was manual — SSH into servers and run scripts.

⏰ Zero Downtime Mandate

22 hours/day of active trading. The 2-hour settlement window wasn't long enough for a big-bang cutover. Migration had to be incremental, service-by-service, with instant rollback capability.

🔗 Tight Coupling

14 services communicated via direct TCP connections, shared database tables, and in-memory caches. Extracting one service required understanding and rerouting all its dependencies.

🔐 Regulatory Compliance

SOC 2 Type II, PCI DSS, and SEC audit requirements meant every infrastructure change needed documented approval, encrypted data in transit and at rest, and complete audit trails.

📊 No CI/CD Pipeline

Deployments were manual SSH sessions with handwritten runbooks. No infrastructure-as-code, no automated testing, no rollback mechanisms. A single typo in a config file had caused a 4-hour outage the previous quarter.

The Solution

Strangler Fig Architecture

We implemented the Strangler Fig pattern — gradually routing traffic from on-prem services to Azure equivalents, one service at a time. An API gateway controlled traffic splitting, allowing 1%, 10%, 50%, then 100% cutover with instant rollback at every stage.

On-Premises (Before)

Windows Server 2016 (3×)

SQL Server 2017 Cluster

RabbitMQ (single node)

Redis 5 (no cluster)

Manual deployments (SSH)

Nagios monitoring

Self-managed TLS certs

→

Azure (After)

AKS (3 node pools, auto-scale)

Azure SQL Hyperscale (geo-replicated)

Azure Service Bus Premium

Redis Enterprise (clustered)

GitHub Actions + ArgoCD

Datadog + PagerDuty

Azure Key Vault + Managed Identities

01 — Terraform Everything

All Azure infrastructure defined in Terraform with remote state in Azure Blob Storage. Every change reviewed via PR, planned automatically, and applied through GitHub Actions. No manual infrastructure changes ever.

02 — AKS with GitOps

ArgoCD manages all Kubernetes deployments declaratively. Blue-green deployments with automated canary analysis. Failed health checks automatically roll back before traffic reaches the new version.

03 — Database Migration

Azure Database Migration Service for initial sync, then continuous replication. Cutover during settlement windows with automated data integrity checks — row counts, checksums, and sample query validation.

04 — Security Hardening

HashiCorp Vault for secrets management. Azure Private Endpoints eliminate public internet exposure. Network Security Groups and Azure WAF provide defense-in-depth. All communication mTLS encrypted.

Technical Deep-Dive

Terraform + GitHub Actions Pipeline

Every infrastructure change follows the same path: PR → automated plan → peer review → merge → automated apply. The pipeline includes security scanning, cost estimation, and drift detection.

main.tf — AKS Cluster Definition

resource "azurerm_kubernetes_cluster" "trading" { name = "aks-trading-prod" location = azurerm_resource_group.main.location resource_group_name = azurerm_resource_group.main.name dns_prefix = "trading" kubernetes_version = "1.28" sku_tier = "Standard" # Uptime SLA default_node_pool { name = "system" vm_size = "Standard_D4s_v5" min_count = 3 max_count = 9 enable_auto_scaling = true zones = [1, 2, 3] # AZ redundancy os_disk_type = "Ephemeral" } identity { type = "SystemAssigned" # No stored creds } network_profile { network_plugin = "azure" network_policy = "calico" # Pod-level firewalls load_balancer_sku = "standard" outbound_type = "userDefinedRouting" # → Azure Firewall } oms_agent { log_analytics_workspace_id = azurerm_log_analytics_workspace.main.id } key_vault_secrets_provider { secret_rotation_enabled = true secret_rotation_interval = "2m" } lifecycle { prevent_destroy = true # Safety net } }

The AKS cluster spans 3 availability zones for hardware-level redundancy. Ephemeral OS disks improve node startup time from 8 minutes to 90 seconds. Calico network policies enforce zero-trust pod communication — services can only reach their declared dependencies.

Results

Measurable Impact

The migration was completed over 14 weeks, with each service migrated during settlement windows. Total user-facing downtime: zero seconds.

Metric	Before	After	Improvement
User-Facing Downtime	12 hrs/quarter (avg)	0 seconds	▲ 99.99% uptime
Infrastructure Cost	$68K/month	$39K/month	▲ 42% reduction
Deploy Frequency	Monthly (manual)	12×/day (automated)	▲ 360× faster
Mean Time to Recovery	4.2 hours	3.5 minutes	▲ 98.6% faster
Security Posture	3 open audit findings	SOC 2 Type II certified	▲ Full compliance
Auto-Scaling	Manual VM provisioning	0→100 in 90 seconds	▲ Fully elastic

Client Feedback

Our board was terrified of this migration. We process billions in daily transactions and any downtime means regulatory scrutiny. Garnet Grid's strangler fig approach let us migrate one service at a time with instant rollback — we could pull the plug in 30 seconds if anything looked wrong. It never did. We went from dreading deployments to shipping 12 times a day with zero anxiety.

Chief Technology Officer

FinTech Platform — $2.3M Daily Transactions

Related Work

More Case Studies

Planning a Cloud Migration?

Let's architect a migration path that doesn't risk your uptime.

Start Your Project →

Zero-DowntimeCloud Migration