SLA Management™ Now Available in RecoverGuard 5.0

February 3, 2010

The planning and execution of disaster recovery procedures often involves multiple teams within an organization, including business continuity managers, as well as storage, system/application, and other groups within the IT department.  However, poor communication and collaboration, as well as conflicting objectives, can create a disconnect between these various teams.  As a result, they often over- or under-provision the disaster recovery systems designed to ensure data protection and availability – causing either unacceptable exposure or significant waste of resources when an event occurs.

Our new Service Level Agreement (SLA) Management module, available with RecoverGuard 5.0, can help organizations overcome these challenges, providing them with a robust solution that ensures sufficient protection at all times, while eliminating the wasted money or staff time associated with over-provisioning.  Those responsible for business continuity and data protection will gain greater control over and visibility into the storage resources they dedicate for disaster recovery purposes.

With a broad range of powerful capabilities, our SLA Management solution can enable companies to avoid the problems typical of disaster recovery procedures, such as lack of redundancy, too few copies of critical data, and over-inflated costs.  This empowers them to more effectively and economically optimize storage allocation and utilization to meet application performance goals and requested service levels.

Key features of our SLA Management module include:

  • An intuitive, business-oriented SLA definition builder that makes it easy for even those users with little or no technical savvy to define levels of service, and associate them with services, servers, databases, and other technology assets.  For example, users can define how often remote copies are refreshed, the type of storage to be used, how long local copies will be retained, or the level of redundancy (if any).
  • Comprehensive reporting that allows IT staff to assess existing service levels, and compare them to policies and guidelines.
  • Real-time alerts that immediately notify stakeholders when deviations from SLA rules take place.

 

Visit our Web site to learn more about our new SLA Management module, and how it can help your business achieve maximum data protection, in the most efficient and cost-effective manner possible.


A common risk identified in remote mirroring configuration

January 10, 2010

Remember our LVM mirroring article from a few weeks ago? This time I’d like to take a closer look at one of the potential risks that were described.

The risk signature:

Incorrect Mirror Configuration for DR


The impact

In any event which requires recovering data from the DR site:

  • Recovery will not be possible
  • Data will be lost
  • RPO SLA will be breached
  • Extended downtime and RTO SLA violation

Technical details

In this scenario, the customer is using mirroring in the LVM level (Logical Volume Management) to create a synchronous copy of the database at the DR site.

The source data is stored on SAN volumes located in the production site where the mirror is supposed to be stored on the SAN volumes at the DR site. However, the configuration is erroneous since the mirrored data is partially stored on volumes from the production SAN array (See Image 1: Incorrect Mirror Configuration). In the event of a disaster, no complete copy of the database will be available at the DR site. Recovery will not be possible. The database will have to be recovered from a recent backup, a process which involves – loss of data, RPO violation and due to the nature of recovering from tape – prolonged downtime and RTO SLA violation.

Can it happen to me?

Yes, for various reasons.

First, configuration errors are inevitable in the enterprise datacenter environment which involve thousands of configuration entities such as arrays, disks, physical volumes, logical volumes and so on…

Moreover, such a vulnerability would go unnoticed until recovery is needed since the mirrored copy is not put to use on a regular basis. Last, configuration drift are created overtime. Even if the environment was set correctly in the past, any change applied may endanger the DR solution validity. For instance, expanding the database to new file systems and/or SAN volumes may break the DR mirror if the implementer does not take into account the intentions of the original design and its complexity.

Image 1: Incorrect Mirror Configuration

Think your data centers may have hidden recoverability and downtime risks such as this?  Find out with the risk free 48-hour RecoverGuard pilot scan.

<!–[if gte mso 9]> Normal 0 false false false EN-US X-NONE HE MicrosoftInternetExplorer4 <![endif]–><!–[if gte mso 9]> <![endif]–> <!–[endif]–>Risk signature:  Incorrect mirror configuration for DR

Why Every Disaster Recovery Plan Must Include RPO

December 21, 2009

In our last post, we provided an overview of recovery point objective (RPO), a critical disaster recovery metric that defines the level of data loss a company is willing to tolerate when an outage takes place.  In this entry, we’ll discuss what makes RPO so crucial when it comes to facilitating proactive disaster recovery planning strategies.

Today’s companies run on information.  Corporate data is leveraged each and every day during automated business transactions and in support of strategic planning and decision-making by executives and managers.  And in many cases, data is made available to customers to enhance service and satisfaction, or to external business and supply chain partners.  So when information is lost as the result of a technical failure or system outage, mission-critical operations can be severely affected.

The level of impact an organization will feel when a disaster strikes will depend greatly on the type of information lost and its primary use.  Data can be classified in a variety of ways – there is data needed for revenue generation, data that enhances the customer experience, data that facilitates internal productivity, etc.  Those who are defining recovery plans need to take these data classes into account, and set an appropriate RPO for each. For example, data that is required for sales and revenue generation, such as inventory information that lets customers know if a product is in stock before they place an order over the Web, would need a shorter, more rigid RPO than data utilized in non-critical internal processes.

By taking this approach, companies can better design, budget for, and implement the optimal IT solution to ensure that all RPOs are met.  Additionally, they can more effectively communicate disaster recovery plans to all internal and external stakeholders, to maximize preparedness if and when an outage does occur.

Those companies who don’t set RPOs for the various types of data they maintain – and the systems that house them – might find that their disaster recovery plans fail in the event of an emergency.  Without setting and measuring RPOs, it can be quite difficult to properly define disaster recovery processes, or to clearly articulate how those processes should be carried out.  As a result, organizations may:

  • Face unnecessary risks due to “gaps” in their plan’s coverage.  This can lead to unacceptable data loss, which ultimately translates to lost revenue, lack of regulatory compliance, customer churn, or damage to brand image and reputation.
  • Hinder the efficiency and effectiveness of disaster recovery procedures.  Without clear instructions, IT teams are likely to either over-provision (wasting valuable human and financial resources) or under-provision (leaving important data unprotected from loss) their environments. In our many years of experience helping companies evaluate the technical validity of their disaster recovery infrastructures, this is one of the issues we see most frequently.

And, perhaps most importantly, remember that defining an RPO is not a one-time event.  It is an ongoing process that must be flexible, leaving room for re-evaluation and refinement as business needs, technology environments, and other internal and external factors change.

Visit our Web site to find out more about RPO, its vital role in the disaster recovery process, and RecoverGuard, our robust solution for enabling rapid, accurate RPO measurement.


Using LVM mirroring for disaster recovery?

December 20, 2009

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

#1: Incorrect mirror configuration

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a local SAN array. Dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by RecoverGuard on datacenters implementing LVM mirroring and snapshots.

#2: Missing mirrors

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… check it out). The outcome is yet again complete data loss upon disaster.

#3: Not build for a large scale datacenter

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on. Several sample weak spots that can be included in this category are:
No federated consistency. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
Difficult to manage PiT copies. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

#4: No true async mode

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

Conclusions?

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as RecoverGuard can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our website for additional information.

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

<strong>#1: Incorrect mirror configuration</strong>

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a <u><strong>local</strong></u> SAN array. Oh dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> on datacenters implementing LVM mirroring and snapshots.

<strong>#2: Missing mirrors</strong>

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… <a href=”http://it.toolbox.com/blogs/disaster-recovery/the-importance-of-it-team-coordination-in-the-real-world-dr-lessons-learned-34313″>check it out</a>). The outcome is yet again complete data loss upon disaster.

<strong>#3: Not build for a large scale datacenter</strong>

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on.  Several sample weak spots that can be included in this category are:
<strong>No federated consistency</strong>. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
<strong>Difficult to manage PiT copies</strong>. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

<strong>#4: No true async mode</strong>

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

<strong>Conclusions? </strong>

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our <a href=”http://www.continuitysoftware.com”>website</a> for additional information.


What is RPO?

November 29, 2009

There are many important elements within any business continuity strategy, but the majority of experts will argue that recovery point objective (RPO) is one of the most vital components.  When developing disaster recovery plans, this important metric, which indicates the level of data loss (measured in time) that a company is willing to accept when disaster strikes, must be included.

More specifically, RPO is the maximum acceptable number of hours of lost data in case of a critical event. For example, if the RPO for an accounting system is four hours, then IT teams will work to bring the application data to the same state it was in no more than four hours before the outage took place.  Any information generated or modified during that time will either be deemed irretrievable, or will need to be re-entered.

In a recent teleconference on benchmarking disaster recovery management readiness, leading industry analyst firm Gartner indicated that the definition, documentation, and updating of RPO requirements for production applications were needed steps “in order to improve disaster recovery predictability, effectiveness, and efficiency”.

RPO should be based on many factors, including the nature and importance of the business process and related systems impacted.  For example, a company may set a stringent (low) RPO for customer relationship management (CRM) applications that facilitate mission-critical sales and service activities, and a less demanding (higher) one for less crucial applications, like inventory management.  Other factors are the human resources required to support recovery efforts, and the IT budgets available to cover associated costs.

While an RPO of zero hours (meaning, no lost data) may sound ideal, for most businesses that goal is both unrealistic and cost-prohibitive.  And, in many cases, particularly systems that process a low volume of transactions or support non-critical activities, an RPO of zero is simply unnecessary.  The goal of RPO should be to balance cost with protection level.  Once RPO is determined, IT departments can then implement the appropriate protection measures, such as setting up back up, snapshots, and replication based on the RPO for each system.

Visit our Web site to find out more about RPO, and to learn about RecoverGuard, our robust solution that enables precise RPO measurement.


The side effects of failover in a cluster

August 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

It happens all the time. You’ve decided to manually switch to a different node in the cluster, or maybe your active node crashed. Luckily enough, production services started running on the formerly passive node (well, sometime they won’t…). Everything is up and running but something has changed and not for the better… usually it’s performance.

If you’re a database/system/storage administrator or an IT manager, you’re probably all too familiar with this scenario.  It doesn’t matter whether you’re using Veritas Cluster, Microsft Cluster, AIX HACMP, HP-UX MC/ServiceGuard or Sun/Linux Clusters. Finding the root cause, if at all, could take weeks over weeks. Database, server and storage configuration (and what’s between) is so complex in today’s datacenters, there could be thousands of potential causes.  Even when you have a “suspect”, testing it may result in additional side effects, or worse – downtime.

Here are a few examples of things that could-go-wrong resulting in performance degradation…but it’s really just the tip of the iceberg:

Example I: Reduced I/O Settings

  • The standby/passive node has less I/O paths to SAN volumes, thus the passive node can carry less I/O load. There are dozens of possible variations to this theme…
  • The standby does have the same number of paths but they are not distributed on Fibre Channel adapters and ports as well as on the active node
  • I/O mode differences – round-robin or other multi-path load balancing algorithm is configured on the active node while the standby is configured for path  fail-over only (no load balancing)
  • Different I/O queue depth configured per device and/or per HBA
  • And so on

Example II:         Different Server Configuration

In this category, there is really an endless list of samples… here are a few:

  • The passive node is configured with different performance settings (for example – on Microsoft Windows processor scheduling is adjusted best performance of programs on the passive node, but for background services on the active node background services)
  • The passive is not installed with latest system or application patch, service pack or version
  • The passive node does not use network interfaces load balancing while the active node does

Example III:        Uneven Database-related Configuration

  • The standby/passive node is configured with reduced values for critical system parameters affecting database performance – such as shared memory parameters, semaphores, file limits, and so on
  • The standby/passive node has different performance-related database configuration (e.g., max number of processes / threads / sessions, memory pools sizes, operation mode, transaction logging settings, software logging settings, and so on)

Can this be avoided?
Do not wait for the next failover event. Why have end users and application teams breathing down your neck? Verify on an ongoing basis that your clusters follow the vendor’s best practices and that all nodes are aligned in terms of software, kernel parameters, operating system settings, limits, configuration files, hardware-related configuration,… and the list goes on.

Automation is required. Automated  monitoring to identify gaps between cluster active and passive nodes is the only practical solution.

A failover is more than just getting everything up and running as fast as possible. Without keeping the same service levels, operations are still damaged and money is lost.  RecoverGuard by Continuity Software addresses challenges by intelligently identifying risks and vulnerabilities which may result in downtime and reduced performance in case of failover.

P.S
Readers, it would be great if you can share with me your experience with failover-related troubles.


Great case study by IDC: Bank of Israel Addresses HA/DR Challenges

August 14, 2009

By Yaniv Valik
SR DR Specialist, DR Assurance Group

The recent global outage by Paypay was a high profile reminder of why it is important to protect your business against unexpected downtime. I thought you’d be interested in learning what the Bank of Israel is doing to avoid a similiar fate. Dan Yachin, an analyst with IDC, has just written a great case study that explores how Israel’s central bank overcame the limitations of traditional disaster recovery and high availability testing to ensure its DR readiness and availability. I really encourage you to take a look at it. You can download the case study  here: http://www.continuitysoftware.com/IDC-BankOfIsrael


Note about New Software Release

August 3, 2009

By Gil Hecht
CEO

I thought I’d use this space to give you a quick update on the latest version of our RecoverGuard automated disaster recovery/high availability testing and monitoring software. As you know, RecoverGuard automatically scans your IT infrastructure to find hidden configuration errors or data protection gaps before they impact your operations.  Here’s a brief summary of the new functionality in Version 4.3:

  • Support for EMC CLARiiON platform, including MirrorView and SnapView. RecoverGuard also supports EMC Symmetrix, NetApp and HDS USP & AMS.  In an upcoming release we’ll expand this list further by adding support for HP XP and IBM DS.
  • Configuration testing support for all major cluster vendors, enabling you to ensure availability at all times
  • Significant advancements to the Availability Advisor, enabling RecoverGuard to scan for risks in kernel parameters, storage routing, domain and DNS settings, patches and service packs, and much more .

Of course, we continue to add new DR/HA risks to RecoverGuard’s robust gap detection knowledgebase, which currently holds over 3,000 potential known risks. We’ve posted some of the more common gaps on this blog, and will continue to do so as time goes on. You can see all of those we’ve posted by clicking on “Gap Analysis” under “Categories” in the upper right portion of this blog. If you’d like to see even more examples, you can see them on our website at http://www.continuitysoftware.com/commongaps


The Importance of IT Team Coordination in the Real World – DR Lessons Learned

June 10, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

This time I’d like to share with you a recent incident from one of our new customers. Before I dive into the details, let me briefly describe the environment. The data centers rely on HDS USP storage with mixed Sun Solaris, HPUX, Linux and Windows servers. Ten percent of the environment is virtualized (VMWare ESX). On the databases side, Oracle and SQL server are in use.

This is a 24/7 environment in which downtime is a disaster. The company considers high availability and disaster recovery as mission critical. The IT staff is highly aware of change management in general and specifically how changes in production may impact high availability and readiness for disaster.

As far as replication, the following process is implemented:
- Local ShadowImages are created twice a day which are backed up to tape by Symantec NetBackup.
- Data is replicated with TrueCopy synchronously to a remote site. TC replicas are mapped to the DR servers.
- 3 point-in-time ShadowImage are taken on the remote site.

I believe every IT pro would classify this as an advanced and modern DR solution. And it is. Nevertheless, even in environments with heightened awareness of high availability and disaster recovery, gaps and configuration drifts are unavoidable. In this case, the local point-in-time ShadowImages taken for backup were not consistent! Data could not be restored from backup.

How did it happen?

As I mentioned before, this is a 24/7 environment. For this reason, database cold backup is out of the question. So they use hot backup such that the database is still online and accessible. To assure image data consistency, a very delicate process must be implemented. Any deviation from this process may result in image consistency issues…which would render the backup unusable.

In essence, the process is:
1. Synchronize the replicas.
2. Verify that full synchronization achieved.
3. Enter hot backup mode.
4. Split the replicas of data files.
5. Verify that split was completed successfully.
6. End hot backup.
7. Switch logs.
8. Split the replicas of log files… verify that split was completed successfully.

Additional steps may include creating copies of control files, enabling storage consistency solutions such as EMC / HDS Consistency Groups, etc.

Note that this process involves different silos, platforms and IT teams.

So how did it happen?

The timing of events was not fully synced.

The database entered hot backup “only” 2 minutes after replica split was already initiated. In this specific case, the issue was cause by the use of different schedulers by different teams (Control-M, crontab). Of course, there could be other various reasons (time not in sync, daylight saving time configuration differences, misunderstanding between IT teams, change performed by one team which another team was unaware of …).

Also, the customer uses Oracle ASM. This means greater risks of data inconsistency in hot backup scenario, since even without any client altering data, automatic rebalancing can be performed by Oracle while generating the point-in-time copies. (Rebalancing impacts replica data consistency similarly to performing database writes).

The first scan by RecoverGuard exposed this vulnerability. The various IT teams were completely unaware of this situation and were amazed when it was discovered. The error was immediately rectified, along with other errors and improvement opportunities detected for Hitachi Dynamic Link Manager (HDLM, used for multi-pathing), Microsoft SQL and Veritas Cluster.

The Business Continuity/DR manager further explained that their quarterly DR tests only included verifying the TrueCopy replicas, and even then they perform graceful shutdown on production, which does not simulate their routine normal backup/replication procedures nor a true disaster scenario.

Today’s data centers are just too complex to manage. Even with the best, configuration drifts and gaps are unavoidable in constantly changing data centers.


Webinar: Why Your DR/HA Systems Will Fail….

May 20, 2009

by Doron Pinhas
VP, Field Operations

Last Thursday I had a great webinar discussion with Analyst Christine Taylor from The Taneja Group on one of the greatest threats to recoverability and HA – configuration drift. The event was called Why Your HA/DR Systems Will Fail…and How to Make Sure They Won’t and if you couldn’t join us live, the webinar is now available on-demand. Just go to our website (http://www.continuitysoftware.com) and click on the link under Latest Webinars.

When configuration drift occurs – and it is inevitable – your production or primary infrastructure configurations become different from your recovery or secondary infrastructure. This creates serious data protection and host configuration gaps that threaten your ability to achieve your Recovery Point and Recovery Time Objectives.

Christine and I covered a lot of topics during our conversations, including:

  • Why configuration drift is a process problem, not a technology problem
  • Why disaster recovery and availability testing falls short of addressing the issue
  • How automated testing and monitoring solutions from companies like Symantec and Continuity Software are helping companies bullet-proof their DR/HA strategies

In addition, I provided a detailed look at several common recoverability/availability gaps that are created by configuration drift, why they occur, how they will impact operations, and how you can avoid them.

I hope you get a chance to tune in. I think you’ll find it worthwhile.