Disaster recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on a company’s business continuity or finances could be termed as a disaster. This includes hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error or some other significant event.
To minimize the impact of a disaster, companies invest time and resources to plan and prepare, to train employees, and to document and update processes. The amount of investment for DR planning for a particular system can vary depending on the cost of a potential outage.
Two common industry terms for disaster planning:
- Recovery time objective: According to AWS, RTO is the time it takes after a disruption to restore a business process to its service level, as defined by the operational level agreement (OLA). For example, if a disaster occurs at 12:00 PM (noon) and the RTO is eight hours, the DR process should restore the business process to the acceptable service level by 8:00 PM.
- Recovery point objective: According to AWS, RPO is the acceptable amount of data loss measured in time. For example, if a disaster occurs at 12:00 PM (noon) and the RPO is one hour, the system should recover all data that was in the system before 11:00 AM. Data loss will be in the span of a one hour, between 11:00 AM and 12:00 PM (noon).
Traditional disaster recovery practices,
1. Suitable capacity to scale the environment.
In traditional environment, best practice to recover from disaster is to buy suitable or large amount of servers and keep them in sync with each other also enough server capacity to run all mission-critical services including storage appliances for the supporting data and servers to run applications and backend services such as user authentication, Domain Name System (DNS), Dynamic Host Configuration Protocol (DHCP), monitoring and alerting.
2. Support for repairing, replacing and refreshing the infrastructure.
To handle large and multiple servers at a time, the user needs all support to maintain them.
3. Network infrastructure such as firewalls, routers, switches and load balancers.
Disaster recovery cannot be completed without network infrastructure, in traditional environment user has to buy and configure network appliances according to their production environment.
Disaster recovery with AWS
1. Backup and Restore
In most traditional environments, data is backed up to disks and sent off-site regularly. Customer’s recovery time will be the longest using this method. Amazon S3 is an ideal solution to backup data, as it is designed to provide 99.999999999% durability of objects over a given year. Transferring data to and from Amazon S3 is typically done via the network and is therefore accessible from any location.
The AWS Storage Gateway service enables snapshots of customer’s on premise data volumes to be transparently copied into Amazon S3 for backup. They can subsequently create local volumes or AWS EBS volumes from these snapshots.
The backup of customer’s data is only half of the story. Recovery of data in a disaster scenario needs to be tested and achieved quickly. Customers should ensure that their systems are configured to appropriate retention of data, security of data and have tested their data recovery processes.
Key steps for backup and restore:
- Select an appropriate tool or method to back up your data into AWS.
- Ensure that you have an appropriate retention policy for this data.
- Ensure that appropriate security measures are in place for this data, including encryption and access policies.
- Regularly test the recovery of this data and restoration of your system.
2. Pilot Light for Simple Recovery into AWS
The second potential approach is Pilot Light, where the data is mirrored, the environment is scripted as a template and a minimal version of the system is always running in a different region. The core element of the system is
- Database: – It is always activated for data replication and for the other layers
- AMI: – These are created and updated periodically.Figure 2: The Pilot Light approach
The Pilot Light approach reduces the RTO and RPO also provide the ease of just turning on the resources. Amazon Cloud formation can be used to automate the provisioning of services.
In the case of a disaster, the environment can be built out and scaled using the backed-up Amazon machine images (AMIs) around the pilot light.
3. Warm Standby Solution in AWS
The next level of the Pilot Light approach is the Warm Standby. It ensures that the recovery time is reduced to almost zero by always running a scaled down version of a fully functional environment. During the recovery phase, in case the production system fails, the standby infrastructure is scaled up to be in line with the production environment and DNS records are updated to route all traffic to the new AWS environment.
Essentially a smaller version of customer’s full production environment is being run here so this approach reduces RTO and RPO but incurs a higher cost as services are running 24/7.
4. Multi-Site Solution deployed on AWS and On-Site
As the optimum technique in backup and disaster recovery, Multi-Site duplicates the environment and there is always another environment serving live traffic running in a different region in an active-active configuration.
Figure 4: Multi site solution on AWS
Amazon Route 53, is used to route production traffic to the different sites. A proportion of traffic will go to your infrastructure in AWS, and the remainder will go to your on-site infrastructure.
Many options and variations for DR exist. This blog highlights some of the common scenarios, ranging from the simple backup-restore solution for fault tolerant, multi-site solutions. AWS gives you fine-grained control and many building blocks to build the appropriate DR solution. AWS services are available on-demand and customers pay only for what they use.