Does Keeping Your Resume Up to Date Count as a Valid HA and DR Strategy?

August 25, 2010

By Gil Hecht, Founder and CEO, Continuity Software

Again and again, we are reminded of just how business critical high availability (HA) and disaster recovery (DR) capabilities are in today’s highly-competitive, ultra-demanding economy. Yet, an unreasonably high percentage of today’s most well-known and respected business organizations are leaving themselves vulnerable to both natural and manmade IT disasters.

One needs only to review recent headlines to see what I mean. For instance, the 7-hour outage at Singapore’s largest banking network (“Global CIO: IBM’s Bank Outage: Anatomy of a Disaster”) and the American Eagle Outfitters 8-day-long disaster (“Oracle Backup Failure Major Factor in American Eagle 8-Day Crash”).

Clearly, HA and DR is a persistent challenge for many data centers, regardless of industry or size. And, while most data centers have implemented an HA and/or DR strategy, most understand there is no guarantee it will actually deliver. Due to the time and expense involved, it usually gets tested once or twice a year. Then, over the following days, weeks and months, changes are made to the production environment that are not replicated appropriately, and the HA/DR strategy is rendered virtually useless.

On a daily basis, I meet with many extremely experienced and talented data center managers to talk about how to ensure their organization’s HA and DR. Many do privately admit that while HA and DR is a high business priority – from both an internal governance and/or external legal regulations standpoint – they recognize that if they were to experience a true disaster, data and application availability would probably be lost for an amount of time that far exceeds SLA guidelines (if not permanently). In fact, one IT executive joked that his DR strategy was to, “Keep my resume up to date.”

OK, just for the sake of argument… How about an affordable and easy to manage solution that mitigates data protection and high availability risks by detecting gaps and vulnerabilities between your primary production, HA cluster and/or remote DR sites?


Cutting HA/DR costs with analytics

July 12, 2010

Availability, Recoverability and Data Protection are critical to any enterprise. The alternative cost is unacceptable in capital and reputation losses. Thus, significant time, resources and money are allocated to ensure that all business lines are highly available and can be recovered at various circumstances (from accidental file deletion to earthquake). However, the datacenter is constantly changing and despite the huge effort and world-class IT experts, new risks emerge on a regular basis. Traditional mitigation approaches do not salvage and regularly cost in downtime, potential data loss and excessive operations since:
• Discrete data is gathered by discrete systems, but none correlates all required layers: storage, OS, Database, replication, clustering, etc.
• Home-grown data collection and correlation is economically irrational

Using an HA/DR analytics solution, organizations can dramatically increase availability and recoverability levels in one hand and on the other hand – save significant time and money. An HA/DR analytics solution such as RecoverGuard by Continuity Software or DRA by Symantec analyzes thousands of potential risks by correlating configuration of applications, databases, file systems, servers, storage, replication, clustering and “what’s between”. It keeps getting updated regularly and many like to think of it as an “anti-virus for HA/DR”.

So how can an HA/DR analytics solution help cut down HA/DR costs? It’s simple really. Most enterprises are over spending in the three following direct-cost areas:
• Cost of avoidable downtime
• DR testing operations expense
• HA/DR related sub-optimal resource utilization

Let’s explore each of these cost areas.
Cost of avoidable downtime. Unsuccessful cluster failover, single point of failures in storage network/multipath, RAID level issues, risky layout of database files, suboptimal configuration of database vs. file systems vs. storage leading to unacceptable performance… all these and much more can be completely avoided by deploying an HA/DR analytics solution. If an hour of downtime costs 100K, a very serious cost reduction opportunity lies here.
DR testing operations expense. The organization can become aware of DR readiness before actually performing the test and failing over. Thus, only execute a DR drill only after resolving known recoverability issues. By doing so, significant time and resources are spared. Furthermore, by identifying new threats on the spot, as they emerge, it is guaranteed that resolution time and involved manpower are minimal. Last night changes are still fresh and identifying the root cause is easy – unlike when an error occurs in a yearly DR drill and no one remembers the specific change (one of many…) that was performed months before and created the error. Moreover, with the reporting features and in-depth visibility to dependencies between production and DR systems, replication, cluster configuration (and so on), a DR test requires less resources from the various IT teams and much less manual labor from the BCP personnel.
HA/DR related sub-optimal resource utilization. While it is not the main purpose of HA/DR analytics, the data gathered by such tools allows them to identify saving opportunities. Examples of such opportunities around storage saving are allocated but unused devices, old replicas, file system or raw device allocated to database but hardly used and so on. On the storage network side, replication bandwidth can be optimized with detection of excessive replication, swap replication, temp database replication and so on. Naturally hidden saving opportunities exist in other layers as well.

HA/DR readiness verification is a too complex task to be performed manually without the right tools for the job (check out my “BCP is not different than other IT departments” post). Considering the different teams involved, different layers, different products, vendors and the endless details embedded within each unique component, it is practically mission impossible manually. You know when a DR test starts but you don’t really know when it is going to end…and you don’t know whether data is recoverable at any given time and when the next downtime event will hit. Automation is the key to success and significant cost reduction. HA/DR analytics solutions can dramatically increase control over HA/DR readiness and at the same time reduce the costs considerably.


BCP is not different than other IT departments

May 20, 2010

Among other, BCP personnel bear the responsibility for recoverability in case of disaster. The BCP manager must verify that at any given time, data can be recovered and operations can be resumed successfully according to the policies (RPO, RTO) set by the organization.  This is all fine and dandy in theory, but take a moment to think about it – How exactly will a BCP manager determine recoverability status at any given time?

At best, a DR test is performed every quarter. Suppose you were responsible for the IT recoverability of a large financial institution and that DR exercises are performed on January 1st, April 1st and so on. What would you tell to the CIO should he ask you on February 15 if IT is recoverable? Would you be able to answer confidently “Yes”? The honest and correct answer would be “Sir, I do not know. We fixed the glitches found on the Jan 1 DR test so I guess IT was recoverable back then…. But now – I cannot say for sure. Probably not.”. It gets worse, right? Sure it does. Further deliberation will expose other weak spots in DR testing that we all experience – the DR test included only a small portion of IT….not all critical systems… production wasn’t really shutdown/cut-off during the test… didn’t really simulate end-users (or load scenarios)…. I can go on and on. And so, the question remains – How will BCP evaluate readiness for DR at all times?

Let’s compare notes with other datacenter departments. How does Network Security know that the network is secured? How does a system administrator know that a server is malfunctioning? The answer is simple: They have visibility into their domain. In other words, they have the tools that allow them to explore their area of responsibility, get an up-to-date detailed status and automatic notification when something goes wrong. BCP, like any other IT department, must have the right tools for the job. Yet unlike System administrators, Database admins (etc.), BCP needs a management solution that provides visibility into all IT layers and not just to servers or just to database configurations and so on. Furthermore, DRM solutions must be capable of analyzing the dependencies between the different layers and find recovery vulnerabilities.

Can you imagine any Organization with 7+ figures IT budget not purchasing a server performance solution such as HP performance manager or IBM Tivoli monitoring? Or network monitoring and event management solutions such as CA Spectrum/eHealth or HP NNM? Of course not, because it’s clear that datacenter monitoring requires automation (A too complex task to be performed continuously and accurately by human beings) and that without automated monitoring, suboptimal operation and downtime are unavoidable. BCP/Recovery management is no different. Without a DRM solution, the BCP personnel are “blind” and are un-aware of datacenter status in terms of readiness for recovery. They must put their trust and faith in the hands IT teams whose first priority is production. They might be kind enough to share some technical details with the BCP team… but a working datacenter is not based on mere kindness and “favors” but on intelligent processes which lead to an efficient, goal-oriented teamwork.

The good news is that high-end DRM solutions have emerged in the last few years, giving BCP personnel just the tools they were missing. Products such as RecoverGuard by Continuity Software and Disaster Recovery Advisor by Symantec provide BCP staff with a real-time business-oriented status of readiness for disaster (including both HA and DR). These analytics tools automatically identify hidden HA/DR risks and let the user know about them as soon as they happen. They also let the users explore the different IT layers, understand dependencies between production databases, servers, storage, remote storage, DR server and so on. If you are thinking about deploying a DRM solution, note that Continuity Software offers a risk free 48-Hour RecoverGuard pilot.

To guarantee successful recovery 365-days a year, BCP/recovery personnel must have solutions that provide visibility all across the datacenter, and that automatically and continuously perform datacenter configuration analysis to ensure no recovery gaps and vulnerabilities exist. Such DRM solutions have grown in the past few years to be an integral part of every large size IT organization as it became apparent that with such solutions, significant downtime and loss of data can be avoided.

I for one believe it a major milestone in the everlasting struggle to control HA/DR.


The importance of creating point-in-time copies on a regular basis

April 5, 2010

Today I’d like to take the time to outline the differences between continuous replication (often referred to as “synchronous” or “asynchronous”) and periodic replication, or point-in-time (PiT) copies.

When referring to “continuous replication”, I include any type of replication which maintains a target copy synchronized with its source on an on-going basis. This category may include:

  • Synchronous replication – Target is identical to the source at any time (no write is performed on the source without being performed 1st on the target). Examples of synchronous replication are EMC SRDF/S, HDS/TC (True Copy), HP Continuous Access Synchronous.
  • Asynchronous replication – Similar to synchronous replication, target is continuously being synchronized with the source; however it may have a lag of several minutes or less. Examples of asynchronous replication would be EMC SRDF/A, HUR (Hitachi Universal Replicator) and IBM Global Mirror.

 Point-in-time (PiT) copies on the other hand, are not continuously synchronized. These copies are being updated with their source at specific times and once they reach full synchronization, the process is stopped. Note that this definition covers both copies kept within the same storage frame (such asEMC TimeFinder, HDS ShadowImage, NetApp snapshots, IBM FlashCopy) and on a different frame (such as NetApp SnapMirror, timed synch-split SRDF, HUR copies, etc.)

 Why do we need a/synchronous replication? Why do we need PiT copies? Do we need both or can we choose only one of the two types? What are the benefits and weak spots of each method?

When it comes to disaster recovery, the different scenarios can be divided to two groups:

  • Physical risks. This group includes hardware failure, outage, natural disaster and so on.
  • Logical risks. This group includes accidental data deletion or corruption by users or applications, software error that harms data integrity, viruses, etc.

 The short answer is:

  • You’d need both continuous replication and PiT copies to ensure successful recovery from both the physical and logical risk scenarios.
  • A/Synchronous replication is mainly aimed at dealing with the “physical risks” group.
  • PiT copies solve the threats associated with the “logical risks” group.

In more details – When an outage or any kind of physical error occurs, continuous replication will allow you to recover with the least amount of data loss (or with no data loss at all). However, when (for example) a file is deleted accidentally, it gets simultaneously removed from your synchronized copy as well! Thus, recovery with continuous replication is doomed to fail. in order to recover – you’ll need a saved copy of the file from a time before its deletion. In other words – a point-in-time copy. The file can be copied from the PiT copy to the fully synchronized target, thus resulting in an up-to-date and valid copy of the source – and operations can continue. Moreover, when an unknown set of files has been compromised, the organization may choose to recover directly from the PiT copy.

Now I know what you’re thinking – “Hi! I can do the same thing with my tape backups. I need no PiT copies”. That is true, but only to some extent. There are many advantages to having PiT copies on top of backup such as:

  • PiT copies are significantly more available – can be used immediately. Moreover, you can have PiT copies in every site.
  • You can create PiT copies frequently – as much as needed (every couple of hours)
  • Retrieving files from backup is painfully slow (hours to days) – hardly enough to meet the (enterprise level) recovery time objectives (RTO)
  • In addition, there are obvious benefits to having PiT copies and taking backup of them instead of directly backing up production…but that’s a whole different topic.

Any organization with strict RPO and RTO policies would do wisely if it’ll choose to maintain both continuous and PiT copies. Any other architecture has weak spots that may end up in loss of data or prolonged downtime in case of failover.


About virtualization and disaster recovery

March 29, 2010

Virtualization and solutions such as VMware SRM certainly encapsulate great advantages for DR testing. The assurance that the production and DR servers are 100% identical is a most appealing feature of virtualization. Other benefits such as the ability to run production and DR in parallel, run a DR exercise whenever you want and the simplicity of virtualized servers also bring progress to the field of Disaster Recovery Testing. However, even with virtualization, successful recovery is far from being a slam dunk. In the next paragraphs I’ll try to outline few of the challenges around recovery in a virtualized environment.

Configuration errors and best practice violations may still render your replicated virtual machines/data corrupt and/or inconsistent, thus irrecoverable. Very much like in the physical (pre-virtualization) world, if you do not follow the rules and devotedly make sure that implementation and day-to-day changes meet the guidelines of the different vendors , your recovery will be at risk. For example, a point-in-time copy (created with EMC TimeFinder, HP StorageWorks XP Business Copy or alike) taken while the source virtual machine was not shutdown or suspended is at high risk of being inconsistent. Some of you may recall similar concepts for creating consistent images for databases such as Oracle (cold/hot backup), UDB (I/O suspension) and other DBMS. Of course, this is just one example of a long list of pre-requisites, guidelines and recommendations – and each vendor has its own list. On top of that – there are specific cross-vendor guidelines e.g. NetApp and VMware, Hitachi and Hyper-V, etc. – but we’ll get to that later.

A DR exercise is still a complex operation. Yes, with tools such as VMware SRM, in theory a DR test is just few clicks away. In reality, there remain many challenges that prevent a frequent DR test, such as:

Complete Prod/DR separation is difficult and mistakes gravely affect production

  • Networks collisions, conflicts, etc
  • Dependencies on physical elements (file server or other sensitive un-virtualized application  - that at the same time interacts with virtualized components)

DR testing is more than just failing over – it’s a complex operation

  • Different teams must verify storage, servers, databases and applications are functioning properly
  • Real life scenario must be simulated including peak load scenarios
  • Workstations must be manned with end users
  • It takes time
    • Problems must be resolved
    • Processes must be coordinated
  • It’s not enough to bring everything online, reasonable performance must be also be tested and assured (see example ahead)

Manpower – as implied by the previous bullet – a DR exercise requires dedicated cross-domain human resources

  • BCP personnel, Project managers, IT managers,…
  • Storage administrators
  • Unix and Windows system administrators
  • Network administrators
  • Security personnel
  • Oracle DBAs, MS-SQL DBAs, …
  • Application owners – WebSphere, Bea, Exchange, Lotus,…
  • End users

 

Dependencies and overlap areas between different domains and areas of responsibility create vulnerabilities which jeopardize the ability to recover successfully. Virtual machines depend on correct storage and replication configuration. For example – reduced RAID level configuration put your VM at risk. Furthermore, if VMFS is being partially remotely replicated or if consistency groups do not include all required resources, data will be lost upon disaster. There are plenty of other samples of what-could-go-wrong in the VMware-Storage overlap (is your remote ESX configured with the same multipath level at the production ESX? Same load balance algorithm? Queue depth? I can go on and on). Other dependencies exist between databases and VMware – depending on your required level of recovery assurance, you may need to put the database in backup mode (Oracle lingo) while creating VMware or storage snapshots (unless you’re willing to settle with recovery-not-guaranteed crash consistent copies…). Another overlap area is between virtualized and un-virtualized environments. In the real world, not all assets are virtualized. Those assets interact with virtualized components. Hence, all the complexities of the physical disaster recovery drill still exits (some may say that having to deal with two types of environments creates even a greater challenge). Examples of such virtual-to-non-virtual relationships are:

  • Virtualized client accessing non-virtualized NFS/CIFS file server
  • Database on VM interacts (via DB links for example) with database on a physical server
  • Virtualized business line relies on data from other non-virtualized business line – or vice verse
  • Virtualized clients accessing not virtualized applications (Exchange, Lotus, etc)
  • Physical domain / DNS servers serving virtualized environments
  • And so on…

 

To sum up – Yes, virtualization is a game changer for Disaster Recovery Management (DRM). Nevertheless, many of the traditional BCP/DR challenges still exist as well as several new challenges which have emerged as a result of using virtualization. Running a DR exercise is simple in theory but not in practice. To ensure successful recovery, an enterprise organization must put significant time, money and human resources. Automation is the key. The use of HA/DR monitoring solutions such as Continuity Software’s RecoverGuard and Symantec’s Disaster Recovery Adviser (DRA) can give BCP and IT teams visibility into dependencies in virtualized and physical environments and automatic availability/recovery vulnerability detection.


SLA Management™ Now Available in RecoverGuard 5.0

February 3, 2010

The planning and execution of disaster recovery procedures often involves multiple teams within an organization, including business continuity managers, as well as storage, system/application, and other groups within the IT department.  However, poor communication and collaboration, as well as conflicting objectives, can create a disconnect between these various teams.  As a result, they often over- or under-provision the disaster recovery systems designed to ensure data protection and availability – causing either unacceptable exposure or significant waste of resources when an event occurs.

Our new Service Level Agreement (SLA) Management module, available with RecoverGuard 5.0, can help organizations overcome these challenges, providing them with a robust solution that ensures sufficient protection at all times, while eliminating the wasted money or staff time associated with over-provisioning.  Those responsible for business continuity and data protection will gain greater control over and visibility into the storage resources they dedicate for disaster recovery purposes.

With a broad range of powerful capabilities, our SLA Management solution can enable companies to avoid the problems typical of disaster recovery procedures, such as lack of redundancy, too few copies of critical data, and over-inflated costs.  This empowers them to more effectively and economically optimize storage allocation and utilization to meet application performance goals and requested service levels.

Key features of our SLA Management module include:

  • An intuitive, business-oriented SLA definition builder that makes it easy for even those users with little or no technical savvy to define levels of service, and associate them with services, servers, databases, and other technology assets.  For example, users can define how often remote copies are refreshed, the type of storage to be used, how long local copies will be retained, or the level of redundancy (if any).
  • Comprehensive reporting that allows IT staff to assess existing service levels, and compare them to policies and guidelines.
  • Real-time alerts that immediately notify stakeholders when deviations from SLA rules take place.

 

Visit our Web site to learn more about our new SLA Management module, and how it can help your business achieve maximum data protection, in the most efficient and cost-effective manner possible.


A common risk identified in remote mirroring configuration

January 10, 2010

Remember our LVM mirroring article from a few weeks ago? This time I’d like to take a closer look at one of the potential risks that were described.

The risk signature:

Incorrect Mirror Configuration for DR


The impact

In any event which requires recovering data from the DR site:

  • Recovery will not be possible
  • Data will be lost
  • RPO SLA will be breached
  • Extended downtime and RTO SLA violation

Technical details

In this scenario, the customer is using mirroring in the LVM level (Logical Volume Management) to create a synchronous copy of the database at the DR site.

The source data is stored on SAN volumes located in the production site where the mirror is supposed to be stored on the SAN volumes at the DR site. However, the configuration is erroneous since the mirrored data is partially stored on volumes from the production SAN array (See Image 1: Incorrect Mirror Configuration). In the event of a disaster, no complete copy of the database will be available at the DR site. Recovery will not be possible. The database will have to be recovered from a recent backup, a process which involves – loss of data, RPO violation and due to the nature of recovering from tape – prolonged downtime and RTO SLA violation.

Can it happen to me?

Yes, for various reasons.

First, configuration errors are inevitable in the enterprise datacenter environment which involve thousands of configuration entities such as arrays, disks, physical volumes, logical volumes and so on…

Moreover, such a vulnerability would go unnoticed until recovery is needed since the mirrored copy is not put to use on a regular basis. Last, configuration drift are created overtime. Even if the environment was set correctly in the past, any change applied may endanger the DR solution validity. For instance, expanding the database to new file systems and/or SAN volumes may break the DR mirror if the implementer does not take into account the intentions of the original design and its complexity.

Image 1: Incorrect Mirror Configuration

Think your data centers may have hidden recoverability and downtime risks such as this?  Find out with the risk free 48-hour RecoverGuard pilot scan.

<!–[if gte mso 9]> Normal 0 false false false EN-US X-NONE HE MicrosoftInternetExplorer4 <![endif]–><!–[if gte mso 9]> <![endif]–> <!–[endif]–>Risk signature:  Incorrect mirror configuration for DR

Why Every Disaster Recovery Plan Must Include RPO

December 21, 2009

In our last post, we provided an overview of recovery point objective (RPO), a critical disaster recovery metric that defines the level of data loss a company is willing to tolerate when an outage takes place.  In this entry, we’ll discuss what makes RPO so crucial when it comes to facilitating proactive disaster recovery planning strategies.

Today’s companies run on information.  Corporate data is leveraged each and every day during automated business transactions and in support of strategic planning and decision-making by executives and managers.  And in many cases, data is made available to customers to enhance service and satisfaction, or to external business and supply chain partners.  So when information is lost as the result of a technical failure or system outage, mission-critical operations can be severely affected.

The level of impact an organization will feel when a disaster strikes will depend greatly on the type of information lost and its primary use.  Data can be classified in a variety of ways – there is data needed for revenue generation, data that enhances the customer experience, data that facilitates internal productivity, etc.  Those who are defining recovery plans need to take these data classes into account, and set an appropriate RPO for each. For example, data that is required for sales and revenue generation, such as inventory information that lets customers know if a product is in stock before they place an order over the Web, would need a shorter, more rigid RPO than data utilized in non-critical internal processes.

By taking this approach, companies can better design, budget for, and implement the optimal IT solution to ensure that all RPOs are met.  Additionally, they can more effectively communicate disaster recovery plans to all internal and external stakeholders, to maximize preparedness if and when an outage does occur.

Those companies who don’t set RPOs for the various types of data they maintain – and the systems that house them – might find that their disaster recovery plans fail in the event of an emergency.  Without setting and measuring RPOs, it can be quite difficult to properly define disaster recovery processes, or to clearly articulate how those processes should be carried out.  As a result, organizations may:

  • Face unnecessary risks due to “gaps” in their plan’s coverage.  This can lead to unacceptable data loss, which ultimately translates to lost revenue, lack of regulatory compliance, customer churn, or damage to brand image and reputation.
  • Hinder the efficiency and effectiveness of disaster recovery procedures.  Without clear instructions, IT teams are likely to either over-provision (wasting valuable human and financial resources) or under-provision (leaving important data unprotected from loss) their environments. In our many years of experience helping companies evaluate the technical validity of their disaster recovery infrastructures, this is one of the issues we see most frequently.

And, perhaps most importantly, remember that defining an RPO is not a one-time event.  It is an ongoing process that must be flexible, leaving room for re-evaluation and refinement as business needs, technology environments, and other internal and external factors change.

Visit our Web site to find out more about RPO, its vital role in the disaster recovery process, and RecoverGuard, our robust solution for enabling rapid, accurate RPO measurement.


Using LVM mirroring for disaster recovery?

December 20, 2009

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

#1: Incorrect mirror configuration

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a local SAN array. Dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by RecoverGuard on datacenters implementing LVM mirroring and snapshots.

#2: Missing mirrors

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… check it out). The outcome is yet again complete data loss upon disaster.

#3: Not build for a large scale datacenter

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on. Several sample weak spots that can be included in this category are:
No federated consistency. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
Difficult to manage PiT copies. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

#4: No true async mode

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

Conclusions?

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as RecoverGuard can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our website for additional information.

One of the disaster recovery solutions in use by many organizations is LVM mirroring and snapshots. Some of you may raise an eyebrow reading this, thinking “LVM mirroring? Over tens of Kilometers?”. The answer is “Yes”. While traditionally LVM mirroring was used locally in order to keep data highly available and to create copies of logical volumes, today more and more organizations choose LVM mirroring as the solution to keep a synchronized copy of the local data at the remote site. So in fact, instead of using storage-based replication technologies, data is copied at the host level. Of course, to ensure high resiliency, reliability, availability and performance, data is still stored on SAN arrays such as EMC DMX, HP XP, IBM DS and so on. I am guessing this approach becomes more and more popular since LVM mirroring is typically free or part of basic already-purchased software packages, while remote replication software usually requires a separated, sometimes costly, license.

Unfortunately LVM mirroring for disaster recovery is not less complex than storage replication. Many of the risks associated with storage replication are still relevant in the LVM mirroring scenario. Moreover, LVM mirroring introduces a few new risks that do not exist in storage replication.

Here are a few examples of configuration errors that are often created over time and lead to data loss and extended recovery time in case of disaster recovery:

<strong>#1: Incorrect mirror configuration</strong>

The file system is striped and the source data is stored on several SAN volumes on the local SAN array. The DR mirror is stored on several SAN volumes however one of these volumes is from a <u><strong>local</strong></u> SAN array. Oh dear Oh dear… in case of disaster, this would result in complete data loss. Data would have to be restored from backup, resulting in RPO and RTO violations. This is a very common risk signature that is detected by <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> on datacenters implementing LVM mirroring and snapshots.

<strong>#2: Missing mirrors</strong>

When using storage synchronous replication, often replication pairs between the local and the remote array are pre-configured. Hence, when a new storage volume is allocated and used on the production site, it is already being replicated and protected. With LVM mirroring, this is not the case. The administrator must keep track of any new logical volume and create the mirror every time. Some changes slip through and the result is un-mirrored unrecoverable production data. In addition, sometimes the administrator knowingly doesn’t create a mirror for a logical volume because it is currently unused. However, at a certain point when the logical volume is put to use and the administrator is either not aware of that or was not notified of that change (which reminds me of the post I’ve made regarding IT team coordination… <a href=”http://it.toolbox.com/blogs/disaster-recovery/the-importance-of-it-team-coordination-in-the-real-world-dr-lessons-learned-34313″>check it out</a>). The outcome is yet again complete data loss upon disaster.

<strong>#3: Not build for a large scale datacenter</strong>

One of the greatest obstacles of LVM mirroring and snapshots is that it was never designed to be used on a large scale. As a result, there are no management tools that will allow you to enforce policies, act on groups and so on.  Several sample weak spots that can be included in this category are:
<strong>No federated consistency</strong>. With storage replication, one may create a disk group that will include many servers and will ensure I/O consistency (but not necessarily application level consistency). With LVM mirroring, this is not an option. Consistency is only guaranteed within a server.
<strong>Difficult to manage PiT copies</strong>. For establishing point-in-time copies (snapshots) heavy scripting will be needed that will require significant maintenance and care. Moreover, large number of snapshots may have a grave impact on the server performance.

<strong>#4: No true async mode</strong>

Most LVM software do not offer an asynchronous mirroring solution. Those who do, usually rely on opportunistic/dirty mirroring, not committing to any SLA. Moreover, some of these solutions require significantly more storage (cascading async mirroring on top of short-distance sync mirroring e.g. “bunker”). In today’s world, a good and reliable asynchronous data copy/replication tool is important. Due to performance considerations, sometimes synchronous replication is not an option. Moreover, by nature synchronous mirroring/replication is only applicable for a short distance. However, LVM mirroring may be more sensitive in the longer distances as the source server needs access to the remote SAN array.

Other honorable mentions include site tagging management, complexities in root volume mirroring (boot from SAN) and the ability to mirror incompatible storage tiers (while some people consider this as an advantage, it may lead to performance loss).

<strong>Conclusions? </strong>

LVM mirroring is not a bad choice for a small datacenter or a specific business service with no strict performance requirements. However, take particular care to monitor and maintain the mirroring and snapshot configuration. Since every logical volume is managed separately, configuration drift is likely to occur which will lead to loss of data and extended recovery time in the event of a disaster. Change management is important, even in medium size environments and surely in the larger datacenters. Monitoring and risk analysis tools such as <a href=”http://www.continuitysoftware.com/products/RecoverGuard”>RecoverGuard</a> can help you detect configuration errors, such as mentioned above, as they occur. If you find this interesting, visit our <a href=”http://www.continuitysoftware.com”>website</a> for additional information.


What is RPO?

November 29, 2009

There are many important elements within any business continuity strategy, but the majority of experts will argue that recovery point objective (RPO) is one of the most vital components.  When developing disaster recovery plans, this important metric, which indicates the level of data loss (measured in time) that a company is willing to accept when disaster strikes, must be included.

More specifically, RPO is the maximum acceptable number of hours of lost data in case of a critical event. For example, if the RPO for an accounting system is four hours, then IT teams will work to bring the application data to the same state it was in no more than four hours before the outage took place.  Any information generated or modified during that time will either be deemed irretrievable, or will need to be re-entered.

In a recent teleconference on benchmarking disaster recovery management readiness, leading industry analyst firm Gartner indicated that the definition, documentation, and updating of RPO requirements for production applications were needed steps “in order to improve disaster recovery predictability, effectiveness, and efficiency”.

RPO should be based on many factors, including the nature and importance of the business process and related systems impacted.  For example, a company may set a stringent (low) RPO for customer relationship management (CRM) applications that facilitate mission-critical sales and service activities, and a less demanding (higher) one for less crucial applications, like inventory management.  Other factors are the human resources required to support recovery efforts, and the IT budgets available to cover associated costs.

While an RPO of zero hours (meaning, no lost data) may sound ideal, for most businesses that goal is both unrealistic and cost-prohibitive.  And, in many cases, particularly systems that process a low volume of transactions or support non-critical activities, an RPO of zero is simply unnecessary.  The goal of RPO should be to balance cost with protection level.  Once RPO is determined, IT departments can then implement the appropriate protection measures, such as setting up back up, snapshots, and replication based on the RPO for each system.

Visit our Web site to find out more about RPO, and to learn about RecoverGuard, our robust solution that enables precise RPO measurement.