The side effects of failover in a cluster

August 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

It happens all the time. You’ve decided to manually switch to a different node in the cluster, or maybe your active node crashed. Luckily enough, production services started running on the formerly passive node (well, sometime they won’t…). Everything is up and running but something has changed and not for the better… usually it’s performance.

If you’re a database/system/storage administrator or an IT manager, you’re probably all too familiar with this scenario.  It doesn’t matter whether you’re using Veritas Cluster, Microsft Cluster, AIX HACMP, HP-UX MC/ServiceGuard or Sun/Linux Clusters. Finding the root cause, if at all, could take weeks over weeks. Database, server and storage configuration (and what’s between) is so complex in today’s datacenters, there could be thousands of potential causes.  Even when you have a “suspect”, testing it may result in additional side effects, or worse – downtime.

Here are a few examples of things that could-go-wrong resulting in performance degradation…but it’s really just the tip of the iceberg:

Example I: Reduced I/O Settings

  • The standby/passive node has less I/O paths to SAN volumes, thus the passive node can carry less I/O load. There are dozens of possible variations to this theme…
  • The standby does have the same number of paths but they are not distributed on Fibre Channel adapters and ports as well as on the active node
  • I/O mode differences – round-robin or other multi-path load balancing algorithm is configured on the active node while the standby is configured for path  fail-over only (no load balancing)
  • Different I/O queue depth configured per device and/or per HBA
  • And so on

Example II:         Different Server Configuration

In this category, there is really an endless list of samples… here are a few:

  • The passive node is configured with different performance settings (for example – on Microsoft Windows processor scheduling is adjusted best performance of programs on the passive node, but for background services on the active node background services)
  • The passive is not installed with latest system or application patch, service pack or version
  • The passive node does not use network interfaces load balancing while the active node does

Example III:        Uneven Database-related Configuration

  • The standby/passive node is configured with reduced values for critical system parameters affecting database performance – such as shared memory parameters, semaphores, file limits, and so on
  • The standby/passive node has different performance-related database configuration (e.g., max number of processes / threads / sessions, memory pools sizes, operation mode, transaction logging settings, software logging settings, and so on)

Can this be avoided?
Do not wait for the next failover event. Why have end users and application teams breathing down your neck? Verify on an ongoing basis that your clusters follow the vendor’s best practices and that all nodes are aligned in terms of software, kernel parameters, operating system settings, limits, configuration files, hardware-related configuration,… and the list goes on.

Automation is required. Automated  monitoring to identify gaps between cluster active and passive nodes is the only practical solution.

A failover is more than just getting everything up and running as fast as possible. Without keeping the same service levels, operations are still damaged and money is lost.  RecoverGuard by Continuity Software addresses challenges by intelligently identifying risks and vulnerabilities which may result in downtime and reduced performance in case of failover.

P.S
Readers, it would be great if you can share with me your experience with failover-related troubles.


Great case study by IDC: Bank of Israel Addresses HA/DR Challenges

August 14, 2009

By Yaniv Valik
SR DR Specialist, DR Assurance Group

The recent global outage by Paypay was a high profile reminder of why it is important to protect your business against unexpected downtime. I thought you’d be interested in learning what the Bank of Israel is doing to avoid a similiar fate. Dan Yachin, an analyst with IDC, has just written a great case study that explores how Israel’s central bank overcame the limitations of traditional disaster recovery and high availability testing to ensure its DR readiness and availability. I really encourage you to take a look at it. You can download the case study  here: http://www.continuitysoftware.com/IDC-BankOfIsrael


Note about New Software Release

August 3, 2009

By Gil Hecht
CEO

I thought I’d use this space to give you a quick update on the latest version of our RecoverGuard automated disaster recovery/high availability testing and monitoring software. As you know, RecoverGuard automatically scans your IT infrastructure to find hidden configuration errors or data protection gaps before they impact your operations.  Here’s a brief summary of the new functionality in Version 4.3:

  • Support for EMC CLARiiON platform, including MirrorView and SnapView. RecoverGuard also supports EMC Symmetrix, NetApp and HDS USP & AMS.  In an upcoming release we’ll expand this list further by adding support for HP XP and IBM DS.
  • Configuration testing support for all major cluster vendors, enabling you to ensure availability at all times
  • Significant advancements to the Availability Advisor, enabling RecoverGuard to scan for risks in kernel parameters, storage routing, domain and DNS settings, patches and service packs, and much more .

Of course, we continue to add new DR/HA risks to RecoverGuard’s robust gap detection knowledgebase, which currently holds over 3,000 potential known risks. We’ve posted some of the more common gaps on this blog, and will continue to do so as time goes on. You can see all of those we’ve posted by clicking on “Gap Analysis” under “Categories” in the upper right portion of this blog. If you’d like to see even more examples, you can see them on our website at http://www.continuitysoftware.com/commongaps


Follow

Get every new post delivered to your Inbox.