Another Hidden downtime risks that can come back to bite you

January 17, 2012

Today’s Topic: Cluster Shared SAN Configuration Drift

The most common way to share data between cluster nodes is through the use of multi-homed SAN storage. Inconsistent access to the SAN volumes by cluster nodes is a state in which one or more shared volumes are not mapped to one or more nodes.

 

Sharing is intended to guarantee immediate data availability in case of a failover, but inconsistent mapping might put failover in jeopardy.

 

Why Does It Happen?

The initial configuration of a cluster is typically correct. However, routine configuration changes such as adding a new storage volume or extending the cluster to additional nodes could gradually result in a configuration drift that leaves one or more shared volumes un-mapped to some of the nodes.
What Is the Impact?
In the event of a cluster failover to the passive node, data stored on an up-mapped volume will not be available, leading to downtime of any application which requires access to a database or files stored on these volumes.

 

How Can It Be Avoided?

There are multiple ways to minimize the risk of such configuration drift:

1.     Documentation: Put in place clear and well-documented procedures for any changes introduced to the cluster configuration.

2.     Training: Conduct periodic training for all involved personnel to review possible availability risks introduced by production environment modifications.

3.     Automation: Implement automated auditing of your high availability environment to ensure passive node configuration is always consistent with active node configuration.

Learn more about Automated Daily High Availability Testing


The Vulnerability Index Benchmark

December 26, 2011

Based on data gathered from 88 organizations worldwide, the Vulnerability Index Benchmark by Continuity Software provides a first of its kind measurement of downtime and data loss risk for each organization, grouped by industry sector.

Get your complimentary copy of the 2011 Vulnerability Index Benchmark study and find out:

  • Which areas of the IT infrastructure present the greatest HA/DR risks?
  • Which industry sectors exhibit the highest levels of downtime and data loss risks?
  • What types of HA/DR risks are most common at each sector?
  • How can you compare your risk to your industry peers?

Link to download the  2011 Vulnerability Index Benchmark study


The side effects of failover in a cluster

August 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

It happens all the time. You’ve decided to manually switch to a different node in the cluster, or maybe your active node crashed. Luckily enough, production services started running on the formerly passive node (well, sometime they won’t…). Everything is up and running but something has changed and not for the better… usually it’s performance.

If you’re a database/system/storage administrator or an IT manager, you’re probably all too familiar with this scenario.  It doesn’t matter whether you’re using Veritas Cluster, Microsft Cluster, AIX HACMP, HP-UX MC/ServiceGuard or Sun/Linux Clusters. Finding the root cause, if at all, could take weeks over weeks. Database, server and storage configuration (and what’s between) is so complex in today’s datacenters, there could be thousands of potential causes.  Even when you have a “suspect”, testing it may result in additional side effects, or worse – downtime.

Here are a few examples of things that could-go-wrong resulting in performance degradation…but it’s really just the tip of the iceberg:

Example I: Reduced I/O Settings

  • The standby/passive node has less I/O paths to SAN volumes, thus the passive node can carry less I/O load. There are dozens of possible variations to this theme…
  • The standby does have the same number of paths but they are not distributed on Fibre Channel adapters and ports as well as on the active node
  • I/O mode differences – round-robin or other multi-path load balancing algorithm is configured on the active node while the standby is configured for path  fail-over only (no load balancing)
  • Different I/O queue depth configured per device and/or per HBA
  • And so on

Example II:         Different Server Configuration

In this category, there is really an endless list of samples… here are a few:

  • The passive node is configured with different performance settings (for example – on Microsoft Windows processor scheduling is adjusted best performance of programs on the passive node, but for background services on the active node background services)
  • The passive is not installed with latest system or application patch, service pack or version
  • The passive node does not use network interfaces load balancing while the active node does

Example III:        Uneven Database-related Configuration

  • The standby/passive node is configured with reduced values for critical system parameters affecting database performance – such as shared memory parameters, semaphores, file limits, and so on
  • The standby/passive node has different performance-related database configuration (e.g., max number of processes / threads / sessions, memory pools sizes, operation mode, transaction logging settings, software logging settings, and so on)

Can this be avoided?
Do not wait for the next failover event. Why have end users and application teams breathing down your neck? Verify on an ongoing basis that your clusters follow the vendor’s best practices and that all nodes are aligned in terms of software, kernel parameters, operating system settings, limits, configuration files, hardware-related configuration,… and the list goes on.

Automation is required. Automated  monitoring to identify gaps between cluster active and passive nodes is the only practical solution.

A failover is more than just getting everything up and running as fast as possible. Without keeping the same service levels, operations are still damaged and money is lost.  RecoverGuard by Continuity Software addresses challenges by intelligently identifying risks and vulnerabilities which may result in downtime and reduced performance in case of failover.

P.S
Readers, it would be great if you can share with me your experience with failover-related troubles.


Great case study by IDC: Bank of Israel Addresses HA/DR Challenges

August 14, 2009

By Yaniv Valik
SR DR Specialist, DR Assurance Group

The recent global outage by Paypay was a high profile reminder of why it is important to protect your business against unexpected downtime. I thought you’d be interested in learning what the Bank of Israel is doing to avoid a similiar fate. Dan Yachin, an analyst with IDC, has just written a great case study that explores how Israel’s central bank overcame the limitations of traditional disaster recovery and high availability testing to ensure its DR readiness and availability. I really encourage you to take a look at it. You can download the case study  here: http://www.continuitysoftware.com/IDC-BankOfIsrael


Note about New Software Release

August 3, 2009

By Gil Hecht
CEO

I thought I’d use this space to give you a quick update on the latest version of our RecoverGuard automated disaster recovery/high availability testing and monitoring software. As you know, RecoverGuard automatically scans your IT infrastructure to find hidden configuration errors or data protection gaps before they impact your operations.  Here’s a brief summary of the new functionality in Version 4.3:

  • Support for EMC CLARiiON platform, including MirrorView and SnapView. RecoverGuard also supports EMC Symmetrix, NetApp and HDS USP & AMS.  In an upcoming release we’ll expand this list further by adding support for HP XP and IBM DS.
  • Configuration testing support for all major cluster vendors, enabling you to ensure availability at all times
  • Significant advancements to the Availability Advisor, enabling RecoverGuard to scan for risks in kernel parameters, storage routing, domain and DNS settings, patches and service packs, and much more .

Of course, we continue to add new DR/HA risks to RecoverGuard’s robust gap detection knowledgebase, which currently holds over 3,000 potential known risks. We’ve posted some of the more common gaps on this blog, and will continue to do so as time goes on. You can see all of those we’ve posted by clicking on “Gap Analysis” under “Categories” in the upper right portion of this blog. If you’d like to see even more examples, you can see them on our website at http://www.continuitysoftware.com/commongaps


Webinar: Why Your DR/HA Systems Will Fail….

May 20, 2009

by Doron Pinhas
VP, Field Operations

Last Thursday I had a great webinar discussion with Analyst Christine Taylor from The Taneja Group on one of the greatest threats to recoverability and HA – configuration drift. The event was called Why Your HA/DR Systems Will Fail…and How to Make Sure They Won’t and if you couldn’t join us live, the webinar is now available on-demand. Just go to our website (http://www.continuitysoftware.com) and click on the link under Latest Webinars.

When configuration drift occurs – and it is inevitable – your production or primary infrastructure configurations become different from your recovery or secondary infrastructure. This creates serious data protection and host configuration gaps that threaten your ability to achieve your Recovery Point and Recovery Time Objectives.

Christine and I covered a lot of topics during our conversations, including:

  • Why configuration drift is a process problem, not a technology problem
  • Why disaster recovery and availability testing falls short of addressing the issue
  • How automated testing and monitoring solutions from companies like Symantec and Continuity Software are helping companies bullet-proof their DR/HA strategies

In addition, I provided a detailed look at several common recoverability/availability gaps that are created by configuration drift, why they occur, how they will impact operations, and how you can avoid them.

I hope you get a chance to tune in. I think you’ll find it worthwhile.


Gap Analysis #6: Configuration Drift between Production and HA

May 16, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

Here’s a gap that we frequently see in HA environments.

Gap: Configuration Drifts between HA Cluster Nodes

Risk: Downtime; manual intervention needed to recover

How does it happen?
While there are many ways this can occur, let’s look at one example: the passive node does not have redundancy in the HBA level nor in the DNS configuration. The currently active node is configured with redundancy for these elements. A single HBA/DNS server configuration is a single point of failure. Upon fail-over/switch-over to the currently passive node, the applications running on this cluster will suffer from reduced availability/MTBF and more downtime. In addition, the passive node is configured with significantly less maximum allowed open files, which may lead to application failures. Moreover, the passive node has only 1GB of swap while the active node was configured with additional 4GB. Upon fail-over, the applications may not have sufficient memory to run properly. Lastly, differences in installed products may have various impacts, depending on the product type.

What is the impact?
This will vary depending upon the specific drift, but can include a failure to switch-over/fail-over/switch-over to other node (causing downtime), or reduced performance after fail-over/switch-over which will, at best, create an operations slowdown and at worst leave the node unable to carry the load

Can it happen to me?
This situation occurs frequently in HA environments. The configuration of a host involves so many details that is it very difficult to ensure an HA server is fully synchronized to its production host at all times.

If you like to read more gap analyses, go to our website at http://www.continuitysoftware.com/commongaps


Video: “Configuration Drift? No Problem.”

April 17, 2009

If you are interested in learning more about the problems posed by configuration drift, here’s an interesting video from Symantec’s website:

http://www.symantec.com/connect/videos/configuration-drift-no-problem-symantec-introduces-veritas-commandcentral-disaster-recovery-a


Gap Analysis #4: Point-in-time copies never tested

March 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

So far in my Gap Analysis series we’ve covered replication inconsistencies, missing networking resources and tampering risk gaps. Today I’m going to take a look at what happens when point-in-time copies are never tested.

Gap: Point-in-time copies never tested

Risk: Data loss and increased time to recover

How does it happen? Point-in-time copies like snapshots and BCVs are the second line of defense to protect against human errors, viruses and outages as well. The DR configuration for applications typically includes:

  • Multiple local point-in-time copies such as EMC TimeFinder, HDS ShadowImage/Snapshot, NetApp FlexClone/Snapshot, or CLARiiON SnapView;
  • Remote synchronous replication such as EMC SRDF, Hitachi TrueCopy, CLARiiON MirrorView, and NetApp SnapMirror; and
  • Local point-in-time copies on the remote site.

In addition, the copies could be mapped to the target DR servers, configured with multi-path software such as EMC PowerPath, Veritas DMP and MPIO, and defined in logical volumes such as Veritas VxVM.

Point-in-time copies can easily become corrupt, without without being discovered, unless the application is fully started and the data integrity is thoroughly tested. There are numerous scenarios that can lead to such a corruption, such as when the replica devices do not all belong to the same consistency group.

What is the impact? The replica is corrupt and unusable. The file system will need to be recreated at the disaster recovery site and data restored from a recent backup, thereby increasing the time to recovery. All data created since the last backup will be lost. Corrupted file systems may still be usable in many cases, and only a close inspection of the content can reveal the fact that the data is meaningless.

Why does the DR test miss this? This gap can be missed if the specific business service is not tested for DR or if the DR test only includes turning on the DR server without actually running the applications.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


New release provides end-to-end solution

March 16, 2009

by Doron Pinhas
VP, Field Operations

All contributors to this blog try very hard not to hype our company’s products and services. Our posts are written to offer insights on a topic and technology we all know very well – DR and HA. However, today I’m going to break from that position just briefly. Continuity Software has announced RecoverGuard 4.0, a new version of its automated HA/DR testing and monitoring solution. I’d like to share with you because it really does represent a major technological advancement. 

With new support for clusters, root cause analysis, high availability gap detection and reporting, RecoverGuard is now a complete, end-to-end solution (from protection of data through availability) for ensuring business continuity.  This is significant because for the first time it will give you visibility into your HA infrastructure – which has never been possible before.

If you’d like to learn more about this important new release, you can:
Read the press release
Watch the on-demand webinar Solving HA/DR Configuration Drift
Check out what the press and analysts had to say


Follow

Get every new post delivered to your Inbox.