Automating DR Resource Switchovers in Skytap
Disaster Recovery in Cloud
Business continuity is obviously one of the most important topics when it comes to moving critical systems. This is even more so, when it comes to moving systems into a public cloud environment. This is due to a few factors:
- Inability to affect provided SLA levels, which can be insufficient to your requirements
- Inability to secure fully a platform, which is partially in someone else’s control
- Inability to validate the entire range of infrastructural implementations that make up the service
So, whenever we have an initiated project for cloud migration, we almost immediately start discussing Disaster Recovery. This impacts directly the landscape design as well as the design of the automation abilities to deploy resources and increase these resources upon request.
In a great number of cases, AIX customers tend to use storage-based replication to secure their primary production systems. The approach is often preferred because it is easy to implement and support even though it can be harder to test and can require a longer RTO.
How do we address this in Skytap on Azure?
Disaster Recovery in Skytap
Skytap on Azure has no DIRECT implementation of storage-based replication. Therefore, we suggest three main approaches to DR in Skytap:
- Second copy of backup
Azure Blob is used as backup storage media in Azure Native and widely accepted For AIX and IBM I when such implementations are made in Skytap on Azure as well. Having a second copy of the backup in Azure Native is one of the basic ways to provide Disaster Recovery
- Template copies
Templates are storage snapshot done on entire LPARs and include any MAS (multi attached storage) disks that are attached to them. These templates in Skytap can be sent to another region.
- Active instances
By far, the most popular implementations that we do, is having active instances in the secondary region and use some form of replication between them. This is exactly what will be discussed below.
Resourcing of DR
Active instances are an excellent way to secure very aggressive RPO and RTO terms for the most critical applications. They do suggest that you have servers, which are already prepared for a Disaster Recovery event. This easily allows the preparation of networking, use of different addresses and different site-to-site tunnels or ExpressRoute to Azure. With Routing prepared beforehand and this allows simplified testing without affecting Production.
However, this can increase costs. Active instances in Skytap costs money for RAM, Storage and public IPs. Often, these trigger some additional costs as well related to Oracle, DB2, WebSphere or any other licenses.
Thus, frequently we make implementation in Disaster Recovery sites with significantly reduced resourcing. That could be 25% or even 15% of the RAM that the servers are using in production. This allows to control these costs for DR and still have the advantages of a very robust DR solution.
What could be the downside of this?
Clearly the necessity to increase the resource rapidly when you need to switchover to DR.
How do we achieve this in Skytap?
Automation of Re-configurations
In Skytap, resource re-configurations are made by stopping the necessary servers. You cannot increase CPU and RAM of a server in Skytap, without making sure first that it is powered off. That is of course a very inconvenient limitation especially if in the event of DR, you need to move quickly. (We believe this limitation will be addressed on the future).
Consequently, we need to automate this process. Only automation and regular testing of that automation will ensure that switchover to DR and increase of resources is done quickly enough to meet the requested RTO.
That usually includes Azure DevOps implementation with a number of pipelines pre-configured, tested and ready to be launched once a decision for DR switch is met.
Skytap Automation Specifics
We have discussed in previous blogs that Skytap has a few specifics, which make it a bit less mainstream for automatic implementation.
First, you have to always take into account the way environments work. When you perform operations on VMs or subnets in an environment, that renders it busy, stopping you from performing any other actions at the same time. Basically, that will mean you cannot perform resource increase on single servers at the same time as the second operation will fail if the first one is not complete.
One way to get around this is to stop the entire environment – all VMs/LPARs at the same time and then work with resources. However, in some cases, that might not work for you as you might need to follow a certain order of stopping and starting the servers.
Another approach is to combine these servers in different environments in DR, having in mind that you will need to stop them on the switchover. So – for instance an environment with DBs and an environment with applications. You usually would stop the applications before the DBs and that will enable you to do so.
Bearing that in mind, be sure to always check for the status of VM and environment in Skytap, so you can circumvent the busy status when you use automation.
Another specific of Skytap is that RAM of Power LPARs is in direct correlation with entitled capacity and virtual processors. So, make sure to automate validated values for these resources.
All of the above means that even with automation, switching to DR with resource optimization can take some time?
But how much time?
RTO impact
Automation of DR switchover with resource increase automation will affect your time to start services in the secondary site or your RTO. Can you determine what that effect will be?
That is a hard question to answer unless one knows the specifics of the implementation itself. This is because there are factors, which directly make stopping an AIX LPAR slower like:
- Lack of DNS connectivity
- Great number of services that need to be stopped
- Slow moving disk operations that need to be completed etc.
Additionally, the number of environments, servers in them and their stopping order will be important to describe the effect on RTO. For instance:
- Are you waiting for environment A to stop in order to stop environment B
- Do you have some buffer in waiting the servers to stop to make sure that they are really powered off
- Do you need to stop something else – like some test or dev environments in the region in order to free resources for the DR servers?
- Do you need to stop services on the LPARs before actually stopping them?
Parallelism of Switchovers
All of these questions are important to determine how you parallelize the switchover, which is a very important concept to save time. As mentioned above your DR design will have to take that in consideration determined mainly by the order that you need for stopping servers.
Make sure that servers are logically separated in more environments (the more the better) as this provides a lot of flexibility in terms of resources, networking and a number of other things.
Once you have spread the servers in suitable environments you it will be easy to plan the power off activities so they do not affect each other. With that type of parallelism, you will not be limited by the Skytap environment busy states.
Combining with Application Failovers
Finally, you have to consider more than just Skytap APIs. There are dependencies to application stops and starts that will also affect the RTO times.
For instance, if you have an Oracle DB, it is not enough to increase the RAM of the LPARs but you will also have to re-configure the DB values that determine how much memory the DB is using to equate it with Production.
There are more of these types of dependencies that will depend on exactly WHAT are you switching over.
L3C are very experienced to a wide range of switchover, so contact us to talk it through. We can:
- Design your landscape in prod and DR
- Analyze your workloads
- Optimize your resourcing
- Automate your switchovers