IBM AIX: High Availability

November 3, 2020 by

Ivaylo Nikolov in AIX, Backup, Disaster Recovery, High Availability

Introduction

In this series, we will try to inform more on our view on a few business continuity topics – mainly backup, high-availability (HA), and disaster recovery (DR) as used in the relation to Power-based AIX environments. The series will include a few articles focused on these topics and describing some of the typical problems and resolutions that we have faced over the years. Moreover, we will try to delve more deeply into which are the most common parts missed by customers, how do one prepare for business continuity improvement, and what results could be expected and achieved.

Different approaches to Business Continuity

If you are not really familiar with the term business continuity, it merely encompasses all measures, tools, and techniques used to tackle all events that could mean any kind of disruption to the business related to IT and not only. Business continuity could be improved by implementing the simplest of a data copy solution used as replication or backup to a full-blown active-active Tier7 disaster recovery solution. If you are working towards providing more security for the business to run its job uninterrupted, then we can say you are pretty much into the business continuity improvement process.

Backup, HA and DR

The most common ways to improve business continuity are of course providing redundancy for services in the same data center (at some part included in HA as well), providing a secure second copy for data (with some retention as requested from the business), and ensuring that business continues its work even in the event of a catastrophic failure and loss of an entire data center (or at least some equal event), which is disaster recovery.

A surprising observation is that high-availability and disaster recovery solutions are often mistaken by the people, who are meant to implement and support them. This can be derived from the fact that often very similar mechanics are used to implement HA and DR.

Thus it might make sense to describe some of the differences between HA and DR at least according to our own experiences:

Main difference between both approaches is that one tackles failure and recovery within the same datacenter and the other failure and recovery between two different data centers
High availability usually uses fully automated tools to switch between the failed resources and the ones used for availability, while DR in its most common form is manually triggered
Resources used by the systems when using high availability mechanics remain in most cases unchanged, thus network configurations, data storage, monitoring, and security are switched to the redundant platform and the process remains transparent to end-users. Disaster recovery is in most cases related to taking decisions by responsible personnel and takes more time, meaning in most times not transparent to the business.

HA – more detail

Bearing in mind the described differences, we must point out that there are different approaches to high availability as well.

These differ mainly on the level that you do the high-availability – whether you go via the application, on the operating system level or on the hardware (hypervisor) itself.

Here is a small comparison:

Type HA	Hypervisor/HW	OS	Application
Description	Hypervisor mgmt software does virtual machine failover to another node, which it also manages	OS clusterware takes care of switching from usually active to passive nodes	Application takes care of switching between different application nodes.
Failover	Active/inactive	Active/passive mostly	Active/active mostly
Setup	Spare resources needed on another HW node. No preparation needed on OS and application level	Installing and configuring OS clusterware, some configuration / scripting may be needed on application level	Installing and configuring the application clusterware with all added requirements to the environment
Level	Fully transparent for application and OS	Fully transparent for application	Hypervisor and OS agnostic. Nodes could be on different platforms
Implication	OS and application restarted	Application restarted	Almost no impact
Skills	Hypervisor/equipment	OS and OS clusterware, might need some application knowledge	Application and application clusterware skills
Costs	Limited costs needed only for the hypervisor. Usually a standard feature	Medium level of costs. Needs at least second license for OS and license for the clusterware on both nodes (if they are two)	High level of costs. Needs second license for application for the availability node as well as for the clusterware
Examples	IBM VM Recovery Manager, VMWare HA, etc.	IBM POWERHA, RHEL HA, SUSE HA, etc.	Oracle RAC, DB2 HA, Progress OpenEdge Replication

HA for AIX and Power

To get to our main point, we need to clarify how AIX on Power works with high availability. For many years IBM’s main focus on Power business were the most critical applications at customer sites. Implementations on AIX were related to complex Oracle, Progress, DB2 or SAP environments, where business continuity is an absolute must. Thus AIX has been focussed on stability and security as well as on OS clusterware (IBM HACMP / POWERHA), which to work as stable as possible.

Thus, efforts have been invested on how to improve OS and POWERHA, but nothing was really made to tackle the competition of increasing popular VMWare vSphere HA solution.

In the last decade, however, IBM has managed to create and support a most stable implementation on pure hypervisor level, which to reboot and relocate LPARs from the active failed to an inactive second node. VM Recovery Manager HA utilizes the use of HMC and PowerVM (VIOs), combined with GUI usage, in order to provide vSphere HA – type functionality, where you do not care about what is on your LPAR, but you have the chance to rely on automatic reboot of LPAR on another healthy Power server.

Having said that, we can still confirm that POWERHA is the predominant solution. It mainly is due to the fact that same AIX experts almost always have the expertise to run and support the cluster environments as well. Moreover, POWERHA is handling IP pools, resource groups within the nodes and thus the implementation is and feels more consistent from an application point of view as well. POWERHA is easily scripted to run some integration with application, monitoring and backup as well, which is not to be underestimated.

In a lot of implementations, the application clustering on AIX level is used as well, however, the high costs are usually a determining factor for customers to avoid it.

Here at L3C, we usually provide a mix of hypervisor HA and OS clusterware for customer to ensure the best possible continuity.

You can learn more about our AIX Cloud Service.

Post Views: 1,596

Get in Touch

+44 0203 918 8910

Blog