Implementing Disaster Recovery (DR) Strategies with Atsky: A Consultative and Strategic Approach

As the chief solutions architect for the Atsky's Robustness Framework, I aid clients in developing resilient workloads on AWS and Google cloud infrastructure. This process prepares them for disaster events, one of the most formidable challenges they might encounter. Such disruptive events can stem from natural disasters like earthquakes or floods, technical malfunctions such as power or network loss, and human-induced actions like unintentional or unauthorised modifications. In essence, any occurrence that impedes a workload or system from fulfilling its business objectives in its primary location is regarded as a disaster. This blog post offers insights on architecting for disaster recovery (DR), a process of preparing for and rebounding from a disaster. DR forms an essential component of your Business Continuity Plan.

Setting DR Objectives Considering that a disaster event can potentially disrupt your workload, your DR objective should be restoring your workload promptly or averting downtime entirely. We adhere to the following metrics:

Recovery Time Objective (RTO): This determines the maximum tolerable duration between service interruption and restoration. It dictates an acceptable length of service downtime.
Recovery Point Objective (RPO): This represents the longest allowable period since the last data recovery point. It sets a benchmark for what's considered an acceptable data loss.

Minimising the RTO and RPO means less downtime and data loss. However, lower RTO and RPO values entail more resource spend and operational complexity. Therefore, it's crucial to set RTO and RPO values that offer optimal value for your workload.

Scope of Impact for a Disaster Event Multi-AZ Strategy Each Atsky region encompasses multiple Availability Zones (AZs), with each AZ situated in a separate and distinct geographic location, housing one or more data centers. This setup significantly reduces the risk of a single event affecting multiple AZs. Hence, a Multi-AZ DR strategy within an Atsky Region can provide the necessary protection against localised disruptions like power outages and flooding.

Multi-Region Strategy Atsky provides several resources to support a multi-Region approach for your workload. This safeguards your business against wide-scale events that can impact multiple data centers across separate and distinct locations. In this blog post, we use a multi-Region approach to illustrate DR strategies, but these can also be applied for Multi-AZ strategies or hybrid (on-premises workload/cloud recovery) strategies.

DR Strategies - Atsky offers resources and services to build a DR strategy that aligns with your business needs. It's crucial to understand the trade-offs between RTO/RPO and costs while selecting the best strategy. This process involves a thorough risk-benefit analysis with the business owner of a workload, informed by engineering/IT insights. You need to determine what RTO and RPO are needed for the workload, and what investment in terms of money, time, and effort you are willing to make.

Active/Passive and Active/Active DR Strategies

Active/Passive DR In an active/passive strategy, the workload operates from a single site, i.e., an Atsky Region, and all requests are handled from this active Region. In the event of a disaster, and if the active Region cannot support workload operation, the passive site transitions to the recovery site (recovery Region). We then take measures so that our workload can run from the recovery Region. All requests are now rerouted to the recovery Region in a process called "failover".
Active/Active DR An active/active strategy involves two or more Regions that actively accept requests, and data is replicated between them. When one Region faces a disaster event, the traffic for that Region is rerouted to the remaining active Region or Regions. However, even though data may be replicated between Regions, we still need to back up the data as part of DR, safeguarding against "human action" or technical software type disasters.

Architecture of the DR Strategies Each DR strategy is unique and will be detailed in future blog posts. Here are brief summaries of each strategy:

Backup and Restore This approach involves creating backups of various Atsky data resources in the source Region and also copying them to another Region. This affords effective protection from disasters of any scope. In the case of Region failover, you must be capable of restoring your infrastructure in the recovery Region, along with data recovery from backup.
Pilot Light With the pilot light strategy, the data is live, but the services are idle. Live data implies that the data stores and databases are up-to-date (or nearly up-to-date) with the active Region and ready to service read operations. In the pilot light strategy, basic infrastructure elements are in place, but functional elements (like compute) are "shut off."
Warm Standby The warm standby strategy maintains live data in addition to periodic backups. A warm standby maintains a minimum deployment that can handle requests, but at a reduced capacity—it cannot handle production-level traffic.
Multi-site Active/Active In the multi-site active/active strategy, two or more Regions are actively accepting requests. Failover consists of re-routing requests away from a Region that cannot serve them. Here, data is replicated across Regions and is actively used to serve read requests in those Regions.

Conclusion Although disaster events pose a threat to your workload availability, by using Atsky's Cloud services you can mitigate or remove these threats. By first understanding business requirements for your workload, you can choose an appropriate DR strategy. Then, using Atsky services, you can design an architecture that achieves the recovery time and recovery point objectives your business needs.

Implementing Disaster Recovery (DR) Strategies with Atsky: A Consultative and Strategic Approach

Recent Posts

Comments