Cohabit runs on an Amazon AWS cluster, located across three separate Availability Zones (AZ) in AP-Southeast-2 (Sydney). There are many different services and elements of AWS that we’re using, each of which have slightly different retention and redundancy.
Retention - The amount of time that backups are stored. So if there was an issue, how far back could we go in terms of restoring data?
Redundancy - A single instance has poor redundancy. Spread over multiple instances gives better redundancy, since you can lose one but the others remain
There are two areas we need to consider for disaster planning. The first is the actual infrastructure, which are the actual server instances, databases, object stores etc. Our databases have a 3d retention policy. They are all multi-AZ, which means even if one AZ went down, there would be no impact.
The servers themselves are “ephemeral”, and we don’t try to back these up. However this is intentional - each time we deploy a new version we destroy all the existing servers and replace them with new updated images (using Docker containers). This means if we did unexpectedly lose some servers, our system will automatically create new ones in a working AZ. In fact this is happening all the time as our platform adjusts to load and new code updates.
The second is the software that runs in and on this infrastructure. The code itself is stored in a versioned code repository (Github). If we were to lose the code from our servers, the right version can be quickly restored from GitHub. The software which runs our infrastructure (Kubernetes, etc) is configured via specific files. These also live on Github, and again if we were to lose these, they can be rapidly restored.
We also depend on certain third party services, such as payment processing by Stripe. We’ve designed our platform to be non dependent, which means if there was a disruption to payment processing, no data is lost. Our platform will simply wait and then continue processing once the payment processor comes back online.
Disaster scenarios:
Our offices become unavailable / unusable
Continuity Impact: none
Recovery process: engineers work remotely, replace premises and computers
Data loss: none
Payment processor unavailable
Continuity Impact: limited payment flow for duration
Recovery process: payments will queue and restart
Data loss: none
AWS single zone outage
Continuity Impact: none
Recovery process: assets in the remaining zones autoscale up to take on the load.
Data loss: none
AWS total region outage
Continuity Impact: severe
Recovery process: rebuild cluster in alternate region eg: AP-southeast-1
Data loss: data recreated from last snapshot (max 24h)
AWS total outage
Continuity Impact: catastrophic
Recovery process: rebuild cluster with alternate vendor (GCS, Azure)
Data loss: data recreated from last snapshot (max 24h), but could impact many other services such as Github, Assembly