IBM AIX: High Availability
In this series, we will try to inform more on our view on a few business continuity topics – mainly backup, high-availability (HA), and disaster recovery (DR) as used in the relation to Power-based AIX environments. The series will include a few articles focused on these topics and describing some of the typical problems and resolutions that we have faced over the years. Moreover, we will try to delve more deeply into which are the most common parts missed by customers, how do one prepare for business continuity improvement, and what results could be expected and achieved.
Different approaches to Business Continuity
If you are not really familiar with the term business continuity, it merely encompasses all measures, tools, and techniques used to tackle all events that could mean any kind of disruption to the business related to IT and not only. Business continuity could be improved by implementing the simplest of a data copy solution used as replication or backup to a full-blown active-active Tier7 disaster recovery solution. If you are working towards providing more security for the business to run its job uninterrupted, then we can say you are pretty much into the business continuity improvement process.
Backup, HA and DR
The most common ways to improve business continuity are of course providing redundancy for services in the same data center (at some part included in HA as well), providing a secure second copy for data (with some retention as requested from the business), and ensuring that business continues its work even in the event of a catastrophic failure and loss of an entire data center (or at least some equal event), which is disaster recovery.
A surprising observation is that high-availability and disaster recovery solutions are often mistaken by the people, who are meant to implement and support them. This can be derived from the fact that often very similar mechanics are used to implement HA and DR.
Thus it might make sense to describe some of the differences between HA and DR at least according to our own experiences:
- Main difference between both approaches is that one tackles failure and recovery within the same datacenter and the other failure and recovery between two different data centers
- High availability usually uses fully automated tools to switch between the failed resources and the ones used for availability, while DR in its most common form is manually triggered
- Resources used by the systems when using high availability mechanics remain in most cases unchanged, thus network configurations, data storage, monitoring, and security are switched to the redundant platform and the process remains transparent to end-users. Disaster recovery is in most cases related to taking decisions by responsible personnel and takes more time, meaning in most times not transparent to the business.
HA – more detail
Bearing in mind the described differences, we must point out that there are different approaches to high availability as well.
These differ mainly on the level that you do the high-availability – whether you go via the application, on the operating system level or on the hardware (hypervisor) itself.
Here is a small comparison:
|Description||Hypervisor mgmt software does virtual machine failover to another node, which it also manages||OS clusterware takes care of switching from usually active to passive nodes||Application takes care of switching between different application nodes.|
|Failover||Active/inactive||Active/passive mostly||Active/active mostly|
|Setup||Spare resources needed on another HW node. No preparation needed on OS and application level||Installing and configuring OS clusterware, some configuration / scripting may be needed on application level||Installing and configuring the application clusterware with all added requirements to the environment|
|Level||Fully transparent for application and OS||Fully transparent for application||Hypervisor and OS agnostic. Nodes could be on different platforms|
|Implication||OS and application restarted||Application restarted||Almost no impact|
|Skills||Hypervisor/equipment||OS and OS clusterware, might need some application knowledge||Application and application clusterware skills|
|Costs||Limited costs needed only for the hypervisor. Usually a standard feature||Medium level of costs. Needs at least second license for OS and license for the clusterware on both nodes (if they are two)||High level of costs. Needs second license for application for the availability node as well as for the clusterware|
|Examples||IBM VM Recovery Manager, VMWare HA, etc.||IBM POWERHA, RHEL HA, SUSE HA, etc.||Oracle RAC, DB2 HA, Progress OpenEdge Replication|
HA for AIX and Power
To get to our main point, we need to clarify how AIX on Power works with high availability. For many years IBM’s main focus on Power business were the most critical applications at customer sites. Implementations on AIX were related to complex Oracle, Progress, DB2 or SAP environments, where business continuity is an absolute must. Thus AIX has been focussed on stability and security as well as on OS clusterware (IBM HACMP / POWERHA), which to work as stable as possible.
Thus, efforts have been invested on how to improve OS and POWERHA, but nothing was really made to tackle the competition of increasing popular VMWare vSphere HA solution.
In the last decade, however, IBM has managed to create and support a most stable implementation on pure hypervisor level, which to reboot and relocate LPARs from the active failed to an inactive second node. VM Recovery Manager HA utilizes the use of HMC and PowerVM (VIOs), combined with GUI usage, in order to provide vSphere HA – type functionality, where you do not care about what is on your LPAR, but you have the chance to rely on automatic reboot of LPAR on another healthy Power server.
Having said that, we can still confirm that POWERHA is the predominant solution. It mainly is due to the fact that same AIX experts almost always have the expertise to run and support the cluster environments as well. Moreover, POWERHA is handling IP pools, resource groups within the nodes and thus the implementation is and feels more consistent from an application point of view as well. POWERHA is easily scripted to run some integration with application, monitoring and backup as well, which is not to be underestimated.
In a lot of implementations, the application clustering on AIX level is used as well, however, the high costs are usually a determining factor for customers to avoid it.
Here at L3C, we usually provide a mix of hypervisor HA and OS clusterware for customer to ensure the best possible continuity.