Fault Tolerance

109 views

Isolation

-- auth service down..?, queries still execute

select * from orders where user_id = 123;

Redundancy

primary: us-east-1a | replica1: us-east-1b | replica2: us-east-1c

Static Stability

# config service fails, use cached configuration

last_known_good: { max_connections: 1000, timeout: 30s }

Architecture Patterns

Control Plane vs Data Plane

control plane api down  database continues serving queries

Multi-Zone/Multi-Region

cluster topology:

primary(us-east)  read_replica(eu-west)  read_replica(ap-south)

Operational Practices

Continuous Failover Testing

# scheduled weekly failover

mysql> set global read_only=1; -- demote primary
mysql> set global read_only=0; -- promote replica

Progressive Delivery

{
  "feature_new_replication": {
    "dev": true,
    "staging": true,
    "prod": false
  }
}

Synchronous Replication

-- mysql semi-sync: require minimum 1 replica acknowledgment

set global rpl_semi_sync_master_wait_for_slave_count = 1;

Failure Handling

Instance Level

# ebs volume migration

aws ec2 detach-volume --volume-id vol-dead
aws ec2 attach-volume --volume-id vol-dead --instance-id i-healthy

Zone Level

proxySQL: us-east-1a (failed)  us-east-1b (new primary)

application connection strings remain unchanged

Region Level

-- promote eu-west read replica to primary

alter system promote standby database to primary;