Hidden recovery risk that will come back to bite you

January 9, 2012

Today’s Hidden Risk: Invalid Database Snapshot

An invalid database copy can be created when a storage-based snapshot (or any other type of copy, such as replica, clone, etc.) of the database is taken while the database is not in “hot backup” mode. In this case, the copy may be unusable when a restore is attempted.

 

Click here to learn:

  • How can an invalid database snapshot occur in your environment?
  • Why will it come back to bite you?
  • What will be the impact on your environment?
  • How can you avoid the risk?

This risk is one of 5,000+ downtime and data loss vulnerabilities identified in over one hundred of the largest data centers worldwide included in Continuity Software’s community-sourced knowledgebase.
Read more now for free tips that will help you avoid this risk


The Vulnerability Index Benchmark

December 26, 2011

Based on data gathered from 88 organizations worldwide, the Vulnerability Index Benchmark by Continuity Software provides a first of its kind measurement of downtime and data loss risk for each organization, grouped by industry sector.

Get your complimentary copy of the 2011 Vulnerability Index Benchmark study and find out:

  • Which areas of the IT infrastructure present the greatest HA/DR risks?
  • Which industry sectors exhibit the highest levels of downtime and data loss risks?
  • What types of HA/DR risks are most common at each sector?
  • How can you compare your risk to your industry peers?

Link to download the  2011 Vulnerability Index Benchmark study


Recent outage events catch BCP with their pants down

August 8, 2011

There has been a flurry of recent reports regarding outages leading to downtime and data loss. Here is just a small sample:

Clearly all the above invest heavily in HA/DR solutions, so why were they caught unprepared when an outage occurred?

We’ve addressed this question in previous posts such as Why DR testing doesn’t work, Think that little config change is minor? Think again and BCP is not different than other IT departments . The short answer is that today’s datacenters are too complex and dynamic to rely on periodic tests or audits.

Conclusion:  real-time verification of HA/DR readiness is required.

How can this be accomplished? Considering the amount of labor put into a single audit/test, automation is a must. Disaster Recovery Management solutions such as RecoverGuard and DR Advisor can help you achieve the following benefits:

  • Continuous non-intrusive end-to-end infrastructure scan for 24×7 visibility into DR risks
  • HA/DR risk detection and readiness of passive systems for failover
  • Ability to measure the de-facto RPO and other HA/DR SLA metrics
  • Comprehensive HA/DR readiness reporting

To learn how the best datacenters in the world accomplish this feat, check out our latest webinar.


Quick Guidebook: Top 10 Private Cloud Risks

March 10, 2011

Enterprises routinely build Disaster Recovery and High Availability measures into their private cloud infrastructure, so why do downtime and data loss risks still exist?

Reality is that even the most robust Disaster Recovery and High Availability (DR/HA) plans are only as good as your ability to test them.

To help you out, we have assembled a community-driven database of over 4,000 issues that pose downtime and data loss risks. While we would love to share them all, you can start with a peek at ten top risks in the private cloud environment.

http://www.continuitysoftware.com/downloads/Top10PrivateCloudRisks.pdf

Yahav Adorian
Continuity Software


Cutting HA/DR costs with analytics

July 12, 2010

Availability, Recoverability and Data Protection are critical to any enterprise. The alternative cost is unacceptable in capital and reputation losses. Thus, significant time, resources and money are allocated to ensure that all business lines are highly available and can be recovered at various circumstances (from accidental file deletion to earthquake). However, the datacenter is constantly changing and despite the huge effort and world-class IT experts, new risks emerge on a regular basis. Traditional mitigation approaches do not salvage and regularly cost in downtime, potential data loss and excessive operations since:
• Discrete data is gathered by discrete systems, but none correlates all required layers: storage, OS, Database, replication, clustering, etc.
• Home-grown data collection and correlation is economically irrational

Using an HA/DR analytics solution, organizations can dramatically increase availability and recoverability levels in one hand and on the other hand – save significant time and money. An HA/DR analytics solution such as RecoverGuard by Continuity Software or DRA by Symantec analyzes thousands of potential risks by correlating configuration of applications, databases, file systems, servers, storage, replication, clustering and “what’s between”. It keeps getting updated regularly and many like to think of it as an “anti-virus for HA/DR”.

So how can an HA/DR analytics solution help cut down HA/DR costs? It’s simple really. Most enterprises are over spending in the three following direct-cost areas:
• Cost of avoidable downtime
• DR testing operations expense
• HA/DR related sub-optimal resource utilization

Let’s explore each of these cost areas.
Cost of avoidable downtime. Unsuccessful cluster failover, single point of failures in storage network/multipath, RAID level issues, risky layout of database files, suboptimal configuration of database vs. file systems vs. storage leading to unacceptable performance… all these and much more can be completely avoided by deploying an HA/DR analytics solution. If an hour of downtime costs 100K, a very serious cost reduction opportunity lies here.
DR testing operations expense. The organization can become aware of DR readiness before actually performing the test and failing over. Thus, only execute a DR drill only after resolving known recoverability issues. By doing so, significant time and resources are spared. Furthermore, by identifying new threats on the spot, as they emerge, it is guaranteed that resolution time and involved manpower are minimal. Last night changes are still fresh and identifying the root cause is easy – unlike when an error occurs in a yearly DR drill and no one remembers the specific change (one of many…) that was performed months before and created the error. Moreover, with the reporting features and in-depth visibility to dependencies between production and DR systems, replication, cluster configuration (and so on), a DR test requires less resources from the various IT teams and much less manual labor from the BCP personnel.
HA/DR related sub-optimal resource utilization. While it is not the main purpose of HA/DR analytics, the data gathered by such tools allows them to identify saving opportunities. Examples of such opportunities around storage saving are allocated but unused devices, old replicas, file system or raw device allocated to database but hardly used and so on. On the storage network side, replication bandwidth can be optimized with detection of excessive replication, swap replication, temp database replication and so on. Naturally hidden saving opportunities exist in other layers as well.

HA/DR readiness verification is a too complex task to be performed manually without the right tools for the job (check out my “BCP is not different than other IT departments” post). Considering the different teams involved, different layers, different products, vendors and the endless details embedded within each unique component, it is practically mission impossible manually. You know when a DR test starts but you don’t really know when it is going to end…and you don’t know whether data is recoverable at any given time and when the next downtime event will hit. Automation is the key to success and significant cost reduction. HA/DR analytics solutions can dramatically increase control over HA/DR readiness and at the same time reduce the costs considerably.


BCP is not different than other IT departments

May 20, 2010

Among other, BCP personnel bear the responsibility for recoverability in case of disaster. The BCP manager must verify that at any given time, data can be recovered and operations can be resumed successfully according to the policies (RPO, RTO) set by the organization.  This is all fine and dandy in theory, but take a moment to think about it – How exactly will a BCP manager determine recoverability status at any given time?

At best, a DR test is performed every quarter. Suppose you were responsible for the IT recoverability of a large financial institution and that DR exercises are performed on January 1st, April 1st and so on. What would you tell to the CIO should he ask you on February 15 if IT is recoverable? Would you be able to answer confidently “Yes”? The honest and correct answer would be “Sir, I do not know. We fixed the glitches found on the Jan 1 DR test so I guess IT was recoverable back then…. But now – I cannot say for sure. Probably not.”. It gets worse, right? Sure it does. Further deliberation will expose other weak spots in DR testing that we all experience – the DR test included only a small portion of IT….not all critical systems… production wasn’t really shutdown/cut-off during the test… didn’t really simulate end-users (or load scenarios)…. I can go on and on. And so, the question remains – How will BCP evaluate readiness for DR at all times?

Let’s compare notes with other datacenter departments. How does Network Security know that the network is secured? How does a system administrator know that a server is malfunctioning? The answer is simple: They have visibility into their domain. In other words, they have the tools that allow them to explore their area of responsibility, get an up-to-date detailed status and automatic notification when something goes wrong. BCP, like any other IT department, must have the right tools for the job. Yet unlike System administrators, Database admins (etc.), BCP needs a management solution that provides visibility into all IT layers and not just to servers or just to database configurations and so on. Furthermore, DRM solutions must be capable of analyzing the dependencies between the different layers and find recovery vulnerabilities.

Can you imagine any Organization with 7+ figures IT budget not purchasing a server performance solution such as HP performance manager or IBM Tivoli monitoring? Or network monitoring and event management solutions such as CA Spectrum/eHealth or HP NNM? Of course not, because it’s clear that datacenter monitoring requires automation (A too complex task to be performed continuously and accurately by human beings) and that without automated monitoring, suboptimal operation and downtime are unavoidable. BCP/Recovery management is no different. Without a DRM solution, the BCP personnel are “blind” and are un-aware of datacenter status in terms of readiness for recovery. They must put their trust and faith in the hands IT teams whose first priority is production. They might be kind enough to share some technical details with the BCP team… but a working datacenter is not based on mere kindness and “favors” but on intelligent processes which lead to an efficient, goal-oriented teamwork.

The good news is that high-end DRM solutions have emerged in the last few years, giving BCP personnel just the tools they were missing. Products such as RecoverGuard by Continuity Software and Disaster Recovery Advisor by Symantec provide BCP staff with a real-time business-oriented status of readiness for disaster (including both HA and DR). These analytics tools automatically identify hidden HA/DR risks and let the user know about them as soon as they happen. They also let the users explore the different IT layers, understand dependencies between production databases, servers, storage, remote storage, DR server and so on. If you are thinking about deploying a DRM solution, note that Continuity Software offers a risk free 48-Hour RecoverGuard pilot.

To guarantee successful recovery 365-days a year, BCP/recovery personnel must have solutions that provide visibility all across the datacenter, and that automatically and continuously perform datacenter configuration analysis to ensure no recovery gaps and vulnerabilities exist. Such DRM solutions have grown in the past few years to be an integral part of every large size IT organization as it became apparent that with such solutions, significant downtime and loss of data can be avoided.

I for one believe it a major milestone in the everlasting struggle to control HA/DR.


The importance of creating point-in-time copies on a regular basis

April 5, 2010

Today I’d like to take the time to outline the differences between continuous replication (often referred to as “synchronous” or “asynchronous”) and periodic replication, or point-in-time (PiT) copies.

When referring to “continuous replication”, I include any type of replication which maintains a target copy synchronized with its source on an on-going basis. This category may include:

  • Synchronous replication – Target is identical to the source at any time (no write is performed on the source without being performed 1st on the target). Examples of synchronous replication are EMC SRDF/S, HDS/TC (True Copy), HP Continuous Access Synchronous.
  • Asynchronous replication – Similar to synchronous replication, target is continuously being synchronized with the source; however it may have a lag of several minutes or less. Examples of asynchronous replication would be EMC SRDF/A, HUR (Hitachi Universal Replicator) and IBM Global Mirror.

 Point-in-time (PiT) copies on the other hand, are not continuously synchronized. These copies are being updated with their source at specific times and once they reach full synchronization, the process is stopped. Note that this definition covers both copies kept within the same storage frame (such asEMC TimeFinder, HDS ShadowImage, NetApp snapshots, IBM FlashCopy) and on a different frame (such as NetApp SnapMirror, timed synch-split SRDF, HUR copies, etc.)

 Why do we need a/synchronous replication? Why do we need PiT copies? Do we need both or can we choose only one of the two types? What are the benefits and weak spots of each method?

When it comes to disaster recovery, the different scenarios can be divided to two groups:

  • Physical risks. This group includes hardware failure, outage, natural disaster and so on.
  • Logical risks. This group includes accidental data deletion or corruption by users or applications, software error that harms data integrity, viruses, etc.

 The short answer is:

  • You’d need both continuous replication and PiT copies to ensure successful recovery from both the physical and logical risk scenarios.
  • A/Synchronous replication is mainly aimed at dealing with the “physical risks” group.
  • PiT copies solve the threats associated with the “logical risks” group.

In more details – When an outage or any kind of physical error occurs, continuous replication will allow you to recover with the least amount of data loss (or with no data loss at all). However, when (for example) a file is deleted accidentally, it gets simultaneously removed from your synchronized copy as well! Thus, recovery with continuous replication is doomed to fail. in order to recover – you’ll need a saved copy of the file from a time before its deletion. In other words – a point-in-time copy. The file can be copied from the PiT copy to the fully synchronized target, thus resulting in an up-to-date and valid copy of the source – and operations can continue. Moreover, when an unknown set of files has been compromised, the organization may choose to recover directly from the PiT copy.

Now I know what you’re thinking – “Hi! I can do the same thing with my tape backups. I need no PiT copies”. That is true, but only to some extent. There are many advantages to having PiT copies on top of backup such as:

  • PiT copies are significantly more available – can be used immediately. Moreover, you can have PiT copies in every site.
  • You can create PiT copies frequently – as much as needed (every couple of hours)
  • Retrieving files from backup is painfully slow (hours to days) – hardly enough to meet the (enterprise level) recovery time objectives (RTO)
  • In addition, there are obvious benefits to having PiT copies and taking backup of them instead of directly backing up production…but that’s a whole different topic.

Any organization with strict RPO and RTO policies would do wisely if it’ll choose to maintain both continuous and PiT copies. Any other architecture has weak spots that may end up in loss of data or prolonged downtime in case of failover.


About virtualization and disaster recovery

March 29, 2010

Virtualization and solutions such as VMware SRM certainly encapsulate great advantages for DR testing. The assurance that the production and DR servers are 100% identical is a most appealing feature of virtualization. Other benefits such as the ability to run production and DR in parallel, run a DR exercise whenever you want and the simplicity of virtualized servers also bring progress to the field of Disaster Recovery Testing. However, even with virtualization, successful recovery is far from being a slam dunk. In the next paragraphs I’ll try to outline few of the challenges around recovery in a virtualized environment.

Configuration errors and best practice violations may still render your replicated virtual machines/data corrupt and/or inconsistent, thus irrecoverable. Very much like in the physical (pre-virtualization) world, if you do not follow the rules and devotedly make sure that implementation and day-to-day changes meet the guidelines of the different vendors , your recovery will be at risk. For example, a point-in-time copy (created with EMC TimeFinder, HP StorageWorks XP Business Copy or alike) taken while the source virtual machine was not shutdown or suspended is at high risk of being inconsistent. Some of you may recall similar concepts for creating consistent images for databases such as Oracle (cold/hot backup), UDB (I/O suspension) and other DBMS. Of course, this is just one example of a long list of pre-requisites, guidelines and recommendations – and each vendor has its own list. On top of that – there are specific cross-vendor guidelines e.g. NetApp and VMware, Hitachi and Hyper-V, etc. – but we’ll get to that later.

A DR exercise is still a complex operation. Yes, with tools such as VMware SRM, in theory a DR test is just few clicks away. In reality, there remain many challenges that prevent a frequent DR test, such as:

Complete Prod/DR separation is difficult and mistakes gravely affect production

  • Networks collisions, conflicts, etc
  • Dependencies on physical elements (file server or other sensitive un-virtualized application  - that at the same time interacts with virtualized components)

DR testing is more than just failing over – it’s a complex operation

  • Different teams must verify storage, servers, databases and applications are functioning properly
  • Real life scenario must be simulated including peak load scenarios
  • Workstations must be manned with end users
  • It takes time
    • Problems must be resolved
    • Processes must be coordinated
  • It’s not enough to bring everything online, reasonable performance must be also be tested and assured (see example ahead)

Manpower – as implied by the previous bullet – a DR exercise requires dedicated cross-domain human resources

  • BCP personnel, Project managers, IT managers,…
  • Storage administrators
  • Unix and Windows system administrators
  • Network administrators
  • Security personnel
  • Oracle DBAs, MS-SQL DBAs, …
  • Application owners – WebSphere, Bea, Exchange, Lotus,…
  • End users

 

Dependencies and overlap areas between different domains and areas of responsibility create vulnerabilities which jeopardize the ability to recover successfully. Virtual machines depend on correct storage and replication configuration. For example – reduced RAID level configuration put your VM at risk. Furthermore, if VMFS is being partially remotely replicated or if consistency groups do not include all required resources, data will be lost upon disaster. There are plenty of other samples of what-could-go-wrong in the VMware-Storage overlap (is your remote ESX configured with the same multipath level at the production ESX? Same load balance algorithm? Queue depth? I can go on and on). Other dependencies exist between databases and VMware – depending on your required level of recovery assurance, you may need to put the database in backup mode (Oracle lingo) while creating VMware or storage snapshots (unless you’re willing to settle with recovery-not-guaranteed crash consistent copies…). Another overlap area is between virtualized and un-virtualized environments. In the real world, not all assets are virtualized. Those assets interact with virtualized components. Hence, all the complexities of the physical disaster recovery drill still exits (some may say that having to deal with two types of environments creates even a greater challenge). Examples of such virtual-to-non-virtual relationships are:

  • Virtualized client accessing non-virtualized NFS/CIFS file server
  • Database on VM interacts (via DB links for example) with database on a physical server
  • Virtualized business line relies on data from other non-virtualized business line – or vice verse
  • Virtualized clients accessing not virtualized applications (Exchange, Lotus, etc)
  • Physical domain / DNS servers serving virtualized environments
  • And so on…

 

To sum up – Yes, virtualization is a game changer for Disaster Recovery Management (DRM). Nevertheless, many of the traditional BCP/DR challenges still exist as well as several new challenges which have emerged as a result of using virtualization. Running a DR exercise is simple in theory but not in practice. To ensure successful recovery, an enterprise organization must put significant time, money and human resources. Automation is the key. The use of HA/DR monitoring solutions such as Continuity Software’s RecoverGuard and Symantec’s Disaster Recovery Adviser (DRA) can give BCP and IT teams visibility into dependencies in virtualized and physical environments and automatic availability/recovery vulnerability detection.


Why Every Disaster Recovery Plan Must Include RPO

December 21, 2009

In our last post, we provided an overview of recovery point objective (RPO), a critical disaster recovery metric that defines the level of data loss a company is willing to tolerate when an outage takes place.  In this entry, we’ll discuss what makes RPO so crucial when it comes to facilitating proactive disaster recovery planning strategies.

Today’s companies run on information.  Corporate data is leveraged each and every day during automated business transactions and in support of strategic planning and decision-making by executives and managers.  And in many cases, data is made available to customers to enhance service and satisfaction, or to external business and supply chain partners.  So when information is lost as the result of a technical failure or system outage, mission-critical operations can be severely affected.

The level of impact an organization will feel when a disaster strikes will depend greatly on the type of information lost and its primary use.  Data can be classified in a variety of ways – there is data needed for revenue generation, data that enhances the customer experience, data that facilitates internal productivity, etc.  Those who are defining recovery plans need to take these data classes into account, and set an appropriate RPO for each. For example, data that is required for sales and revenue generation, such as inventory information that lets customers know if a product is in stock before they place an order over the Web, would need a shorter, more rigid RPO than data utilized in non-critical internal processes.

By taking this approach, companies can better design, budget for, and implement the optimal IT solution to ensure that all RPOs are met.  Additionally, they can more effectively communicate disaster recovery plans to all internal and external stakeholders, to maximize preparedness if and when an outage does occur.

Those companies who don’t set RPOs for the various types of data they maintain – and the systems that house them – might find that their disaster recovery plans fail in the event of an emergency.  Without setting and measuring RPOs, it can be quite difficult to properly define disaster recovery processes, or to clearly articulate how those processes should be carried out.  As a result, organizations may:

  • Face unnecessary risks due to “gaps” in their plan’s coverage.  This can lead to unacceptable data loss, which ultimately translates to lost revenue, lack of regulatory compliance, customer churn, or damage to brand image and reputation.
  • Hinder the efficiency and effectiveness of disaster recovery procedures.  Without clear instructions, IT teams are likely to either over-provision (wasting valuable human and financial resources) or under-provision (leaving important data unprotected from loss) their environments. In our many years of experience helping companies evaluate the technical validity of their disaster recovery infrastructures, this is one of the issues we see most frequently.

And, perhaps most importantly, remember that defining an RPO is not a one-time event.  It is an ongoing process that must be flexible, leaving room for re-evaluation and refinement as business needs, technology environments, and other internal and external factors change.

Visit our Web site to find out more about RPO, its vital role in the disaster recovery process, and RecoverGuard, our robust solution for enabling rapid, accurate RPO measurement.


Using LVM mirroring for disaster recovery?

December 20, 2009

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

#1: Incorrect mirror configuration

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a local SAN array. Dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by RecoverGuard on datacenters implementing LVM mirroring and snapshots.

#2: Missing mirrors

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… check it out). The outcome is yet again complete data loss upon disaster.

#3: Not build for a large scale datacenter

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on. Several sample weak spots that can be included in this category are:
No federated consistency. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
Difficult to manage PiT copies. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

#4: No true async mode

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

Conclusions?

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as RecoverGuard can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our website for additional information.

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

<strong>#1: Incorrect mirror configuration</strong>

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a <u><strong>local</strong></u> SAN array. Oh dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> on datacenters implementing LVM mirroring and snapshots.

<strong>#2: Missing mirrors</strong>

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… <a href=”http://it.toolbox.com/blogs/disaster-recovery/the-importance-of-it-team-coordination-in-the-real-world-dr-lessons-learned-34313″>check it out</a>). The outcome is yet again complete data loss upon disaster.

<strong>#3: Not build for a large scale datacenter</strong>

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on.  Several sample weak spots that can be included in this category are:
<strong>No federated consistency</strong>. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
<strong>Difficult to manage PiT copies</strong>. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

<strong>#4: No true async mode</strong>

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

<strong>Conclusions? </strong>

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our <a href=”http://www.continuitysoftware.com”>website</a> for additional information.


Follow

Get every new post delivered to your Inbox.