Business Continuity Management (BCM)¶
Business Continuity Planning¶
DoubleGDP’s Business Continuity Planning (BCP) encompasses the system of protection, prevention, and recovery of our services and data in the event of a disaster or unforeseen circumstances. We continually add and improve this plan as our platform grows and we onboard new clients.
Disaster Recovery Plan¶
Our servers and databases are hosted on Amazon Web Services (AWS) and Heroku, and we follow their standard disaster recovery plan.
- Heroku’s plan for Disaster Recovery: The Heroku platform is designed for stability, scaling, and inherently mitigates common issues that lead to outages while maintaining recovery capabilities. The Heroku platform maintains redundancy to prevent single points of failure, can replace failed components, and utilizes multiple data centers designed for resiliency. In the case of an outage, the platform is deployed across multiple data centers using current system images and data is restored from backups. Heroku reviews platform issues to understand the root cause, impact to customers, and improve the platform and processes.
In the event of a complete loss of AWS and Heroku servers and databases, DoubleGDP will provision production servers at another cloud provider. This step could take up to 4 days. However, this would only be a scenario in case of a catastrophic failure that impacts and effectively takes down all of AWS across multiple data centers.
Data Backup Policy¶
Our databases are continuously backed up using Heroku Postgres. Heroku Postgres uses physical backups for continuous protection by persisting incremental snapshots or base backups of the file system, and write-ahead log (WAL) files to external, reliable storage. With this approach, in case of a hardware failure, we may lose up to 2 minutes’ worth of transactions.
Our backups are replicated across multiple data centers on AWS. As such, in case of a disaster, the database instance will have to be recreated and the data loaded from our current backups.
We are considering adding remote site backups, as well as hot backups to the platform in the near future. We aim to have some or all these expanded backup capabilities including more frequent backups of critical services by 2021-11-01.
Data Recovery Time¶
Generally, the Heroku platform will automatically restore the DoubleGDP application and DoubleGDP’s Postgres databases in the case of an outage. Our setup utilizes Heroku’s general set up to dynamically deploy our application within the Heroku cloud, monitor for failures, and recover failed platform components including customer applications and databases.
In cases where recovery is required outside of automatic restore, DoubleGDP will allocate time to investigate the cause of the downtime and the extent of the data loss if any after an incident. DoubleGDP service agreement on data recovery starts after DoubleGDP has accurately identified the cause of the downtime. Our current database recovery time from backup is 30 minutes, but as our database expands we expect the recovery time from backup to increase significantly. However, we aim to expand our recovery and data back plan for critical data by Nov 1st, 2021, including hot backups which will help maintain recovery time at acceptable levels. For more information on our data backup plan, refer to our Data Backup Policy.
Data Recovery Point¶
Heroku’s recovery point approach stores a rolling 4 days’ worth of data, which enables us to recover data from any point in time within that window. Beyond this, we also leverage Heroku to take and keep snapshots of most of our databases while the database is fully available and make a verbatim copy of the instance’s disk. This includes dead tuples, bloat, indexes, and all structural characteristics of the currently running database. The rate at which Heroku captures snapshots is dynamic. For average or low change databases, Heroku captures a snapshot at least every 24 hours. For databases that change more frequently, Heroku captures them more often than 24 hours. In addition to snapshots which are managed by Heroku, we also generate daily backup files at 11 PM UTC which are stored in Heroku’s AWS S3 environment.
Service Monitoring and Communication¶
We use Rollbar error reporting to monitor our platform and are notified through Slack and Pagerduty when anomalies happen. Our response time is generally 20 mins from the time an error is reported to the time it is triaged and analyzed. The errors are then prioritized and resolved based on impact on business continuity.
Because our engineering team is distributed worldwide, we have 24-hour coverage to receive alerts from Rollbar and begin to diagnose the root cause and recovery actions..
We actively monitor and regularly communicate any system outages, scheduled downtimes, or platform issues that impact our client’s business operations. For any support issues, our clients can email us at [email protected] or directly contact their Customer Success Manager for assistance.
Business Continuity Testing and Validation¶
We routinely restore backups of our databases to validate (i) the integrity of our backup strategy (ii) the backup recovery time. We also constantly measure the deployment time of the application and factor that time into our recovery process.