The importance of creating point-in-time copies on a regular basis

April 5, 2010

Today I’d like to take the time to outline the differences between continuous replication (often referred to as “synchronous” or “asynchronous”) and periodic replication, or point-in-time (PiT) copies.

When referring to “continuous replication”, I include any type of replication which maintains a target copy synchronized with its source on an on-going basis. This category may include:

  • Synchronous replication – Target is identical to the source at any time (no write is performed on the source without being performed 1st on the target). Examples of synchronous replication are EMC SRDF/S, HDS/TC (True Copy), HP Continuous Access Synchronous.
  • Asynchronous replication – Similar to synchronous replication, target is continuously being synchronized with the source; however it may have a lag of several minutes or less. Examples of asynchronous replication would be EMC SRDF/A, HUR (Hitachi Universal Replicator) and IBM Global Mirror.

 Point-in-time (PiT) copies on the other hand, are not continuously synchronized. These copies are being updated with their source at specific times and once they reach full synchronization, the process is stopped. Note that this definition covers both copies kept within the same storage frame (such asEMC TimeFinder, HDS ShadowImage, NetApp snapshots, IBM FlashCopy) and on a different frame (such as NetApp SnapMirror, timed synch-split SRDF, HUR copies, etc.)

 Why do we need a/synchronous replication? Why do we need PiT copies? Do we need both or can we choose only one of the two types? What are the benefits and weak spots of each method?

When it comes to disaster recovery, the different scenarios can be divided to two groups:

  • Physical risks. This group includes hardware failure, outage, natural disaster and so on.
  • Logical risks. This group includes accidental data deletion or corruption by users or applications, software error that harms data integrity, viruses, etc.

 The short answer is:

  • You’d need both continuous replication and PiT copies to ensure successful recovery from both the physical and logical risk scenarios.
  • A/Synchronous replication is mainly aimed at dealing with the “physical risks” group.
  • PiT copies solve the threats associated with the “logical risks” group.

In more details – When an outage or any kind of physical error occurs, continuous replication will allow you to recover with the least amount of data loss (or with no data loss at all). However, when (for example) a file is deleted accidentally, it gets simultaneously removed from your synchronized copy as well! Thus, recovery with continuous replication is doomed to fail. in order to recover – you’ll need a saved copy of the file from a time before its deletion. In other words – a point-in-time copy. The file can be copied from the PiT copy to the fully synchronized target, thus resulting in an up-to-date and valid copy of the source – and operations can continue. Moreover, when an unknown set of files has been compromised, the organization may choose to recover directly from the PiT copy.

Now I know what you’re thinking – “Hi! I can do the same thing with my tape backups. I need no PiT copies”. That is true, but only to some extent. There are many advantages to having PiT copies on top of backup such as:

  • PiT copies are significantly more available – can be used immediately. Moreover, you can have PiT copies in every site.
  • You can create PiT copies frequently – as much as needed (every couple of hours)
  • Retrieving files from backup is painfully slow (hours to days) – hardly enough to meet the (enterprise level) recovery time objectives (RTO)
  • In addition, there are obvious benefits to having PiT copies and taking backup of them instead of directly backing up production…but that’s a whole different topic.

Any organization with strict RPO and RTO policies would do wisely if it’ll choose to maintain both continuous and PiT copies. Any other architecture has weak spots that may end up in loss of data or prolonged downtime in case of failover.


About virtualization and disaster recovery

March 29, 2010

Virtualization and solutions such as VMware SRM certainly encapsulate great advantages for DR testing. The assurance that the production and DR servers are 100% identical is a most appealing feature of virtualization. Other benefits such as the ability to run production and DR in parallel, run a DR exercise whenever you want and the simplicity of virtualized servers also bring progress to the field of Disaster Recovery Testing. However, even with virtualization, successful recovery is far from being a slam dunk. In the next paragraphs I’ll try to outline few of the challenges around recovery in a virtualized environment.

Configuration errors and best practice violations may still render your replicated virtual machines/data corrupt and/or inconsistent, thus irrecoverable. Very much like in the physical (pre-virtualization) world, if you do not follow the rules and devotedly make sure that implementation and day-to-day changes meet the guidelines of the different vendors , your recovery will be at risk. For example, a point-in-time copy (created with EMC TimeFinder, HP StorageWorks XP Business Copy or alike) taken while the source virtual machine was not shutdown or suspended is at high risk of being inconsistent. Some of you may recall similar concepts for creating consistent images for databases such as Oracle (cold/hot backup), UDB (I/O suspension) and other DBMS. Of course, this is just one example of a long list of pre-requisites, guidelines and recommendations – and each vendor has its own list. On top of that – there are specific cross-vendor guidelines e.g. NetApp and VMware, Hitachi and Hyper-V, etc. – but we’ll get to that later.

A DR exercise is still a complex operation. Yes, with tools such as VMware SRM, in theory a DR test is just few clicks away. In reality, there remain many challenges that prevent a frequent DR test, such as:

Complete Prod/DR separation is difficult and mistakes gravely affect production

  • Networks collisions, conflicts, etc
  • Dependencies on physical elements (file server or other sensitive un-virtualized application  - that at the same time interacts with virtualized components)

DR testing is more than just failing over – it’s a complex operation

  • Different teams must verify storage, servers, databases and applications are functioning properly
  • Real life scenario must be simulated including peak load scenarios
  • Workstations must be manned with end users
  • It takes time
    • Problems must be resolved
    • Processes must be coordinated
  • It’s not enough to bring everything online, reasonable performance must be also be tested and assured (see example ahead)

Manpower – as implied by the previous bullet – a DR exercise requires dedicated cross-domain human resources

  • BCP personnel, Project managers, IT managers,…
  • Storage administrators
  • Unix and Windows system administrators
  • Network administrators
  • Security personnel
  • Oracle DBAs, MS-SQL DBAs, …
  • Application owners – WebSphere, Bea, Exchange, Lotus,…
  • End users

 

Dependencies and overlap areas between different domains and areas of responsibility create vulnerabilities which jeopardize the ability to recover successfully. Virtual machines depend on correct storage and replication configuration. For example – reduced RAID level configuration put your VM at risk. Furthermore, if VMFS is being partially remotely replicated or if consistency groups do not include all required resources, data will be lost upon disaster. There are plenty of other samples of what-could-go-wrong in the VMware-Storage overlap (is your remote ESX configured with the same multipath level at the production ESX? Same load balance algorithm? Queue depth? I can go on and on). Other dependencies exist between databases and VMware – depending on your required level of recovery assurance, you may need to put the database in backup mode (Oracle lingo) while creating VMware or storage snapshots (unless you’re willing to settle with recovery-not-guaranteed crash consistent copies…). Another overlap area is between virtualized and un-virtualized environments. In the real world, not all assets are virtualized. Those assets interact with virtualized components. Hence, all the complexities of the physical disaster recovery drill still exits (some may say that having to deal with two types of environments creates even a greater challenge). Examples of such virtual-to-non-virtual relationships are:

  • Virtualized client accessing non-virtualized NFS/CIFS file server
  • Database on VM interacts (via DB links for example) with database on a physical server
  • Virtualized business line relies on data from other non-virtualized business line – or vice verse
  • Virtualized clients accessing not virtualized applications (Exchange, Lotus, etc)
  • Physical domain / DNS servers serving virtualized environments
  • And so on…

 

To sum up – Yes, virtualization is a game changer for Disaster Recovery Management (DRM). Nevertheless, many of the traditional BCP/DR challenges still exist as well as several new challenges which have emerged as a result of using virtualization. Running a DR exercise is simple in theory but not in practice. To ensure successful recovery, an enterprise organization must put significant time, money and human resources. Automation is the key. The use of HA/DR monitoring solutions such as Continuity Software’s RecoverGuard and Symantec’s Disaster Recovery Adviser (DRA) can give BCP and IT teams visibility into dependencies in virtualized and physical environments and automatic availability/recovery vulnerability detection.


SLA Management™ Now Available in RecoverGuard 5.0

February 3, 2010

The planning and execution of disaster recovery procedures often involves multiple teams within an organization, including business continuity managers, as well as storage, system/application, and other groups within the IT department.  However, poor communication and collaboration, as well as conflicting objectives, can create a disconnect between these various teams.  As a result, they often over- or under-provision the disaster recovery systems designed to ensure data protection and availability – causing either unacceptable exposure or significant waste of resources when an event occurs.

Our new Service Level Agreement (SLA) Management module, available with RecoverGuard 5.0, can help organizations overcome these challenges, providing them with a robust solution that ensures sufficient protection at all times, while eliminating the wasted money or staff time associated with over-provisioning.  Those responsible for business continuity and data protection will gain greater control over and visibility into the storage resources they dedicate for disaster recovery purposes.

With a broad range of powerful capabilities, our SLA Management solution can enable companies to avoid the problems typical of disaster recovery procedures, such as lack of redundancy, too few copies of critical data, and over-inflated costs.  This empowers them to more effectively and economically optimize storage allocation and utilization to meet application performance goals and requested service levels.

Key features of our SLA Management module include:

  • An intuitive, business-oriented SLA definition builder that makes it easy for even those users with little or no technical savvy to define levels of service, and associate them with services, servers, databases, and other technology assets.  For example, users can define how often remote copies are refreshed, the type of storage to be used, how long local copies will be retained, or the level of redundancy (if any).
  • Comprehensive reporting that allows IT staff to assess existing service levels, and compare them to policies and guidelines.
  • Real-time alerts that immediately notify stakeholders when deviations from SLA rules take place.

 

Visit our Web site to learn more about our new SLA Management module, and how it can help your business achieve maximum data protection, in the most efficient and cost-effective manner possible.


A common risk identified in remote mirroring configuration

January 10, 2010

Remember our LVM mirroring article from a few weeks ago? This time I’d like to take a closer look at one of the potential risks that were described.

The risk signature:

Incorrect Mirror Configuration for DR


The impact

In any event which requires recovering data from the DR site:

  • Recovery will not be possible
  • Data will be lost
  • RPO SLA will be breached
  • Extended downtime and RTO SLA violation

Technical details

In this scenario, the customer is using mirroring in the LVM level (Logical Volume Management) to create a synchronous copy of the database at the DR site.

The source data is stored on SAN volumes located in the production site where the mirror is supposed to be stored on the SAN volumes at the DR site. However, the configuration is erroneous since the mirrored data is partially stored on volumes from the production SAN array (See Image 1: Incorrect Mirror Configuration). In the event of a disaster, no complete copy of the database will be available at the DR site. Recovery will not be possible. The database will have to be recovered from a recent backup, a process which involves – loss of data, RPO violation and due to the nature of recovering from tape – prolonged downtime and RTO SLA violation.

Can it happen to me?

Yes, for various reasons.

First, configuration errors are inevitable in the enterprise datacenter environment which involve thousands of configuration entities such as arrays, disks, physical volumes, logical volumes and so on…

Moreover, such a vulnerability would go unnoticed until recovery is needed since the mirrored copy is not put to use on a regular basis. Last, configuration drift are created overtime. Even if the environment was set correctly in the past, any change applied may endanger the DR solution validity. For instance, expanding the database to new file systems and/or SAN volumes may break the DR mirror if the implementer does not take into account the intentions of the original design and its complexity.

Image 1: Incorrect Mirror Configuration

Think your data centers may have hidden recoverability and downtime risks such as this?  Find out with the risk free 48-hour RecoverGuard pilot scan.

<!–[if gte mso 9]> Normal 0 false false false EN-US X-NONE HE MicrosoftInternetExplorer4 <![endif]–><!–[if gte mso 9]> <![endif]–> <!–[endif]–>Risk signature:  Incorrect mirror configuration for DR

Why Every Disaster Recovery Plan Must Include RPO

December 21, 2009

In our last post, we provided an overview of recovery point objective (RPO), a critical disaster recovery metric that defines the level of data loss a company is willing to tolerate when an outage takes place.  In this entry, we’ll discuss what makes RPO so crucial when it comes to facilitating proactive disaster recovery planning strategies.

Today’s companies run on information.  Corporate data is leveraged each and every day during automated business transactions and in support of strategic planning and decision-making by executives and managers.  And in many cases, data is made available to customers to enhance service and satisfaction, or to external business and supply chain partners.  So when information is lost as the result of a technical failure or system outage, mission-critical operations can be severely affected.

The level of impact an organization will feel when a disaster strikes will depend greatly on the type of information lost and its primary use.  Data can be classified in a variety of ways – there is data needed for revenue generation, data that enhances the customer experience, data that facilitates internal productivity, etc.  Those who are defining recovery plans need to take these data classes into account, and set an appropriate RPO for each. For example, data that is required for sales and revenue generation, such as inventory information that lets customers know if a product is in stock before they place an order over the Web, would need a shorter, more rigid RPO than data utilized in non-critical internal processes.

By taking this approach, companies can better design, budget for, and implement the optimal IT solution to ensure that all RPOs are met.  Additionally, they can more effectively communicate disaster recovery plans to all internal and external stakeholders, to maximize preparedness if and when an outage does occur.

Those companies who don’t set RPOs for the various types of data they maintain – and the systems that house them – might find that their disaster recovery plans fail in the event of an emergency.  Without setting and measuring RPOs, it can be quite difficult to properly define disaster recovery processes, or to clearly articulate how those processes should be carried out.  As a result, organizations may:

  • Face unnecessary risks due to “gaps” in their plan’s coverage.  This can lead to unacceptable data loss, which ultimately translates to lost revenue, lack of regulatory compliance, customer churn, or damage to brand image and reputation.
  • Hinder the efficiency and effectiveness of disaster recovery procedures.  Without clear instructions, IT teams are likely to either over-provision (wasting valuable human and financial resources) or under-provision (leaving important data unprotected from loss) their environments. In our many years of experience helping companies evaluate the technical validity of their disaster recovery infrastructures, this is one of the issues we see most frequently.

And, perhaps most importantly, remember that defining an RPO is not a one-time event.  It is an ongoing process that must be flexible, leaving room for re-evaluation and refinement as business needs, technology environments, and other internal and external factors change.

Visit our Web site to find out more about RPO, its vital role in the disaster recovery process, and RecoverGuard, our robust solution for enabling rapid, accurate RPO measurement.


Using LVM mirroring for disaster recovery?

December 20, 2009

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

#1: Incorrect mirror configuration

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a local SAN array. Dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by RecoverGuard on datacenters implementing LVM mirroring and snapshots.

#2: Missing mirrors

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… check it out). The outcome is yet again complete data loss upon disaster.

#3: Not build for a large scale datacenter

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on. Several sample weak spots that can be included in this category are:
No federated consistency. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
Difficult to manage PiT copies. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

#4: No true async mode

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

Conclusions?

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as RecoverGuard can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our website for additional information.

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

<strong>#1: Incorrect mirror configuration</strong>

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a <u><strong>local</strong></u> SAN array. Oh dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> on datacenters implementing LVM mirroring and snapshots.

<strong>#2: Missing mirrors</strong>

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… <a href=”http://it.toolbox.com/blogs/disaster-recovery/the-importance-of-it-team-coordination-in-the-real-world-dr-lessons-learned-34313″>check it out</a>). The outcome is yet again complete data loss upon disaster.

<strong>#3: Not build for a large scale datacenter</strong>

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on.  Several sample weak spots that can be included in this category are:
<strong>No federated consistency</strong>. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
<strong>Difficult to manage PiT copies</strong>. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

<strong>#4: No true async mode</strong>

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

<strong>Conclusions? </strong>

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our <a href=”http://www.continuitysoftware.com”>website</a> for additional information.


What is RPO?

November 29, 2009

There are many important elements within any business continuity strategy, but the majority of experts will argue that recovery point objective (RPO) is one of the most vital components.  When developing disaster recovery plans, this important metric, which indicates the level of data loss (measured in time) that a company is willing to accept when disaster strikes, must be included.

More specifically, RPO is the maximum acceptable number of hours of lost data in case of a critical event. For example, if the RPO for an accounting system is four hours, then IT teams will work to bring the application data to the same state it was in no more than four hours before the outage took place.  Any information generated or modified during that time will either be deemed irretrievable, or will need to be re-entered.

In a recent teleconference on benchmarking disaster recovery management readiness, leading industry analyst firm Gartner indicated that the definition, documentation, and updating of RPO requirements for production applications were needed steps “in order to improve disaster recovery predictability, effectiveness, and efficiency”.

RPO should be based on many factors, including the nature and importance of the business process and related systems impacted.  For example, a company may set a stringent (low) RPO for customer relationship management (CRM) applications that facilitate mission-critical sales and service activities, and a less demanding (higher) one for less crucial applications, like inventory management.  Other factors are the human resources required to support recovery efforts, and the IT budgets available to cover associated costs.

While an RPO of zero hours (meaning, no lost data) may sound ideal, for most businesses that goal is both unrealistic and cost-prohibitive.  And, in many cases, particularly systems that process a low volume of transactions or support non-critical activities, an RPO of zero is simply unnecessary.  The goal of RPO should be to balance cost with protection level.  Once RPO is determined, IT departments can then implement the appropriate protection measures, such as setting up back up, snapshots, and replication based on the RPO for each system.

Visit our Web site to find out more about RPO, and to learn about RecoverGuard, our robust solution that enables precise RPO measurement.


The side effects of failover in a cluster

August 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

It happens all the time. You’ve decided to manually switch to a different node in the cluster, or maybe your active node crashed. Luckily enough, production services started running on the formerly passive node (well, sometime they won’t…). Everything is up and running but something has changed and not for the better… usually it’s performance.

If you’re a database/system/storage administrator or an IT manager, you’re probably all too familiar with this scenario.  It doesn’t matter whether you’re using Veritas Cluster, Microsft Cluster, AIX HACMP, HP-UX MC/ServiceGuard or Sun/Linux Clusters. Finding the root cause, if at all, could take weeks over weeks. Database, server and storage configuration (and what’s between) is so complex in today’s datacenters, there could be thousands of potential causes.  Even when you have a “suspect”, testing it may result in additional side effects, or worse – downtime.

Here are a few examples of things that could-go-wrong resulting in performance degradation…but it’s really just the tip of the iceberg:

Example I: Reduced I/O Settings

  • The standby/passive node has less I/O paths to SAN volumes, thus the passive node can carry less I/O load. There are dozens of possible variations to this theme…
  • The standby does have the same number of paths but they are not distributed on Fibre Channel adapters and ports as well as on the active node
  • I/O mode differences – round-robin or other multi-path load balancing algorithm is configured on the active node while the standby is configured for path  fail-over only (no load balancing)
  • Different I/O queue depth configured per device and/or per HBA
  • And so on

Example II:         Different Server Configuration

In this category, there is really an endless list of samples… here are a few:

  • The passive node is configured with different performance settings (for example – on Microsoft Windows processor scheduling is adjusted best performance of programs on the passive node, but for background services on the active node background services)
  • The passive is not installed with latest system or application patch, service pack or version
  • The passive node does not use network interfaces load balancing while the active node does

Example III:        Uneven Database-related Configuration

  • The standby/passive node is configured with reduced values for critical system parameters affecting database performance – such as shared memory parameters, semaphores, file limits, and so on
  • The standby/passive node has different performance-related database configuration (e.g., max number of processes / threads / sessions, memory pools sizes, operation mode, transaction logging settings, software logging settings, and so on)

Can this be avoided?
Do not wait for the next failover event. Why have end users and application teams breathing down your neck? Verify on an ongoing basis that your clusters follow the vendor’s best practices and that all nodes are aligned in terms of software, kernel parameters, operating system settings, limits, configuration files, hardware-related configuration,… and the list goes on.

Automation is required. Automated  monitoring to identify gaps between cluster active and passive nodes is the only practical solution.

A failover is more than just getting everything up and running as fast as possible. Without keeping the same service levels, operations are still damaged and money is lost.  RecoverGuard by Continuity Software addresses challenges by intelligently identifying risks and vulnerabilities which may result in downtime and reduced performance in case of failover.

P.S
Readers, it would be great if you can share with me your experience with failover-related troubles.


Great case study by IDC: Bank of Israel Addresses HA/DR Challenges

August 14, 2009

By Yaniv Valik
SR DR Specialist, DR Assurance Group

The recent global outage by Paypay was a high profile reminder of why it is important to protect your business against unexpected downtime. I thought you’d be interested in learning what the Bank of Israel is doing to avoid a similiar fate. Dan Yachin, an analyst with IDC, has just written a great case study that explores how Israel’s central bank overcame the limitations of traditional disaster recovery and high availability testing to ensure its DR readiness and availability. I really encourage you to take a look at it. You can download the case study  here: http://www.continuitysoftware.com/IDC-BankOfIsrael


Note about New Software Release

August 3, 2009

By Gil Hecht
CEO

I thought I’d use this space to give you a quick update on the latest version of our RecoverGuard automated disaster recovery/high availability testing and monitoring software. As you know, RecoverGuard automatically scans your IT infrastructure to find hidden configuration errors or data protection gaps before they impact your operations.  Here’s a brief summary of the new functionality in Version 4.3:

  • Support for EMC CLARiiON platform, including MirrorView and SnapView. RecoverGuard also supports EMC Symmetrix, NetApp and HDS USP & AMS.  In an upcoming release we’ll expand this list further by adding support for HP XP and IBM DS.
  • Configuration testing support for all major cluster vendors, enabling you to ensure availability at all times
  • Significant advancements to the Availability Advisor, enabling RecoverGuard to scan for risks in kernel parameters, storage routing, domain and DNS settings, patches and service packs, and much more .

Of course, we continue to add new DR/HA risks to RecoverGuard’s robust gap detection knowledgebase, which currently holds over 3,000 potential known risks. We’ve posted some of the more common gaps on this blog, and will continue to do so as time goes on. You can see all of those we’ve posted by clicking on “Gap Analysis” under “Categories” in the upper right portion of this blog. If you’d like to see even more examples, you can see them on our website at http://www.continuitysoftware.com/commongaps


Follow

Get every new post delivered to your Inbox.