Quick Guidebook: Top 10 Private Cloud Risks

March 10, 2011

Enterprises routinely build Disaster Recovery and High Availability measures into their private cloud infrastructure, so why do downtime and data loss risks still exist?

Reality is that even the most robust Disaster Recovery and High Availability (DR/HA) plans are only as good as your ability to test them.

To help you out, we have assembled a community-driven database of over 4,000 issues that pose downtime and data loss risks. While we would love to share them all, you can start with a peek at ten top risks in the private cloud environment.

http://www.continuitysoftware.com/downloads/Top10PrivateCloudRisks.pdf

Yahav Adorian
Continuity Software


Why Every Disaster Recovery Plan Must Include RPO

December 21, 2009

In our last post, we provided an overview of recovery point objective (RPO), a critical disaster recovery metric that defines the level of data loss a company is willing to tolerate when an outage takes place.  In this entry, we’ll discuss what makes RPO so crucial when it comes to facilitating proactive disaster recovery planning strategies.

Today’s companies run on information.  Corporate data is leveraged each and every day during automated business transactions and in support of strategic planning and decision-making by executives and managers.  And in many cases, data is made available to customers to enhance service and satisfaction, or to external business and supply chain partners.  So when information is lost as the result of a technical failure or system outage, mission-critical operations can be severely affected.

The level of impact an organization will feel when a disaster strikes will depend greatly on the type of information lost and its primary use.  Data can be classified in a variety of ways – there is data needed for revenue generation, data that enhances the customer experience, data that facilitates internal productivity, etc.  Those who are defining recovery plans need to take these data classes into account, and set an appropriate RPO for each. For example, data that is required for sales and revenue generation, such as inventory information that lets customers know if a product is in stock before they place an order over the Web, would need a shorter, more rigid RPO than data utilized in non-critical internal processes.

By taking this approach, companies can better design, budget for, and implement the optimal IT solution to ensure that all RPOs are met.  Additionally, they can more effectively communicate disaster recovery plans to all internal and external stakeholders, to maximize preparedness if and when an outage does occur.

Those companies who don’t set RPOs for the various types of data they maintain – and the systems that house them – might find that their disaster recovery plans fail in the event of an emergency.  Without setting and measuring RPOs, it can be quite difficult to properly define disaster recovery processes, or to clearly articulate how those processes should be carried out.  As a result, organizations may:

  • Face unnecessary risks due to “gaps” in their plan’s coverage.  This can lead to unacceptable data loss, which ultimately translates to lost revenue, lack of regulatory compliance, customer churn, or damage to brand image and reputation.
  • Hinder the efficiency and effectiveness of disaster recovery procedures.  Without clear instructions, IT teams are likely to either over-provision (wasting valuable human and financial resources) or under-provision (leaving important data unprotected from loss) their environments. In our many years of experience helping companies evaluate the technical validity of their disaster recovery infrastructures, this is one of the issues we see most frequently.

And, perhaps most importantly, remember that defining an RPO is not a one-time event.  It is an ongoing process that must be flexible, leaving room for re-evaluation and refinement as business needs, technology environments, and other internal and external factors change.

Visit our Web site to find out more about RPO, its vital role in the disaster recovery process, and RecoverGuard, our robust solution for enabling rapid, accurate RPO measurement.


What is RPO?

November 29, 2009

There are many important elements within any business continuity strategy, but the majority of experts will argue that recovery point objective (RPO) is one of the most vital components.  When developing disaster recovery plans, this important metric, which indicates the level of data loss (measured in time) that a company is willing to accept when disaster strikes, must be included.

More specifically, RPO is the maximum acceptable number of hours of lost data in case of a critical event. For example, if the RPO for an accounting system is four hours, then IT teams will work to bring the application data to the same state it was in no more than four hours before the outage took place.  Any information generated or modified during that time will either be deemed irretrievable, or will need to be re-entered.

In a recent teleconference on benchmarking disaster recovery management readiness, leading industry analyst firm Gartner indicated that the definition, documentation, and updating of RPO requirements for production applications were needed steps “in order to improve disaster recovery predictability, effectiveness, and efficiency”.

RPO should be based on many factors, including the nature and importance of the business process and related systems impacted.  For example, a company may set a stringent (low) RPO for customer relationship management (CRM) applications that facilitate mission-critical sales and service activities, and a less demanding (higher) one for less crucial applications, like inventory management.  Other factors are the human resources required to support recovery efforts, and the IT budgets available to cover associated costs.

While an RPO of zero hours (meaning, no lost data) may sound ideal, for most businesses that goal is both unrealistic and cost-prohibitive.  And, in many cases, particularly systems that process a low volume of transactions or support non-critical activities, an RPO of zero is simply unnecessary.  The goal of RPO should be to balance cost with protection level.  Once RPO is determined, IT departments can then implement the appropriate protection measures, such as setting up back up, snapshots, and replication based on the RPO for each system.

Visit our Web site to find out more about RPO, and to learn about RecoverGuard, our robust solution that enables precise RPO measurement.


Note about New Software Release

August 3, 2009

By Gil Hecht
CEO

I thought I’d use this space to give you a quick update on the latest version of our RecoverGuard automated disaster recovery/high availability testing and monitoring software. As you know, RecoverGuard automatically scans your IT infrastructure to find hidden configuration errors or data protection gaps before they impact your operations.  Here’s a brief summary of the new functionality in Version 4.3:

  • Support for EMC CLARiiON platform, including MirrorView and SnapView. RecoverGuard also supports EMC Symmetrix, NetApp and HDS USP & AMS.  In an upcoming release we’ll expand this list further by adding support for HP XP and IBM DS.
  • Configuration testing support for all major cluster vendors, enabling you to ensure availability at all times
  • Significant advancements to the Availability Advisor, enabling RecoverGuard to scan for risks in kernel parameters, storage routing, domain and DNS settings, patches and service packs, and much more .

Of course, we continue to add new DR/HA risks to RecoverGuard’s robust gap detection knowledgebase, which currently holds over 3,000 potential known risks. We’ve posted some of the more common gaps on this blog, and will continue to do so as time goes on. You can see all of those we’ve posted by clicking on “Gap Analysis” under “Categories” in the upper right portion of this blog. If you’d like to see even more examples, you can see them on our website at http://www.continuitysoftware.com/commongaps


The Importance of IT Team Coordination in the Real World – DR Lessons Learned

June 10, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

This time I’d like to share with you a recent incident from one of our new customers. Before I dive into the details, let me briefly describe the environment. The data centers rely on HDS USP storage with mixed Sun Solaris, HPUX, Linux and Windows servers. Ten percent of the environment is virtualized (VMWare ESX). On the databases side, Oracle and SQL server are in use.

This is a 24/7 environment in which downtime is a disaster. The company considers high availability and disaster recovery as mission critical. The IT staff is highly aware of change management in general and specifically how changes in production may impact high availability and readiness for disaster.

As far as replication, the following process is implemented:
- Local ShadowImages are created twice a day which are backed up to tape by Symantec NetBackup.
- Data is replicated with TrueCopy synchronously to a remote site. TC replicas are mapped to the DR servers.
- 3 point-in-time ShadowImage are taken on the remote site.

I believe every IT pro would classify this as an advanced and modern DR solution. And it is. Nevertheless, even in environments with heightened awareness of high availability and disaster recovery, gaps and configuration drifts are unavoidable. In this case, the local point-in-time ShadowImages taken for backup were not consistent! Data could not be restored from backup.

How did it happen?

As I mentioned before, this is a 24/7 environment. For this reason, database cold backup is out of the question. So they use hot backup such that the database is still online and accessible. To assure image data consistency, a very delicate process must be implemented. Any deviation from this process may result in image consistency issues…which would render the backup unusable.

In essence, the process is:
1. Synchronize the replicas.
2. Verify that full synchronization achieved.
3. Enter hot backup mode.
4. Split the replicas of data files.
5. Verify that split was completed successfully.
6. End hot backup.
7. Switch logs.
8. Split the replicas of log files… verify that split was completed successfully.

Additional steps may include creating copies of control files, enabling storage consistency solutions such as EMC / HDS Consistency Groups, etc.

Note that this process involves different silos, platforms and IT teams.

So how did it happen?

The timing of events was not fully synced.

The database entered hot backup “only” 2 minutes after replica split was already initiated. In this specific case, the issue was cause by the use of different schedulers by different teams (Control-M, crontab). Of course, there could be other various reasons (time not in sync, daylight saving time configuration differences, misunderstanding between IT teams, change performed by one team which another team was unaware of …).

Also, the customer uses Oracle ASM. This means greater risks of data inconsistency in hot backup scenario, since even without any client altering data, automatic rebalancing can be performed by Oracle while generating the point-in-time copies. (Rebalancing impacts replica data consistency similarly to performing database writes).

The first scan by RecoverGuard exposed this vulnerability. The various IT teams were completely unaware of this situation and were amazed when it was discovered. The error was immediately rectified, along with other errors and improvement opportunities detected for Hitachi Dynamic Link Manager (HDLM, used for multi-pathing), Microsoft SQL and Veritas Cluster.

The Business Continuity/DR manager further explained that their quarterly DR tests only included verifying the TrueCopy replicas, and even then they perform graceful shutdown on production, which does not simulate their routine normal backup/replication procedures nor a true disaster scenario.

Today’s data centers are just too complex to manage. Even with the best, configuration drifts and gaps are unavoidable in constantly changing data centers.


Interesting DR stories on Oracle blog post

April 30, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

I came across this post the other day when checking out Alejandro Vargas’ blog on the Oracle website:

http://blogs.oracle.com/AlejandroVargas/2008/02/disaster_recovery_stories.html

 He recounts a couple of interesting disaster recovery cases that he was involved in, and explains what actions he took to recover some valuable data. It’s definitely worth taking a few minutes to read through it. And you have have some interesting DR stories (don’t we all?), I’d love to hear about them….the good, the bad, and the ugly!


Gap Analysis #4: Point-in-time copies never tested

March 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

So far in my Gap Analysis series we’ve covered replication inconsistencies, missing networking resources and tampering risk gaps. Today I’m going to take a look at what happens when point-in-time copies are never tested.

Gap: Point-in-time copies never tested

Risk: Data loss and increased time to recover

How does it happen? Point-in-time copies like snapshots and BCVs are the second line of defense to protect against human errors, viruses and outages as well. The DR configuration for applications typically includes:

  • Multiple local point-in-time copies such as EMC TimeFinder, HDS ShadowImage/Snapshot, NetApp FlexClone/Snapshot, or CLARiiON SnapView;
  • Remote synchronous replication such as EMC SRDF, Hitachi TrueCopy, CLARiiON MirrorView, and NetApp SnapMirror; and
  • Local point-in-time copies on the remote site.

In addition, the copies could be mapped to the target DR servers, configured with multi-path software such as EMC PowerPath, Veritas DMP and MPIO, and defined in logical volumes such as Veritas VxVM.

Point-in-time copies can easily become corrupt, without without being discovered, unless the application is fully started and the data integrity is thoroughly tested. There are numerous scenarios that can lead to such a corruption, such as when the replica devices do not all belong to the same consistency group.

What is the impact? The replica is corrupt and unusable. The file system will need to be recreated at the disaster recovery site and data restored from a recent backup, thereby increasing the time to recovery. All data created since the last backup will be lost. Corrupted file systems may still be usable in many cases, and only a close inspection of the content can reveal the fact that the data is meaningless.

Why does the DR test miss this? This gap can be missed if the specific business service is not tested for DR or if the DR test only includes turning on the DR server without actually running the applications.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


New release provides end-to-end solution

March 16, 2009

by Doron Pinhas
VP, Field Operations

All contributors to this blog try very hard not to hype our company’s products and services. Our posts are written to offer insights on a topic and technology we all know very well – DR and HA. However, today I’m going to break from that position just briefly. Continuity Software has announced RecoverGuard 4.0, a new version of its automated HA/DR testing and monitoring solution. I’d like to share with you because it really does represent a major technological advancement. 

With new support for clusters, root cause analysis, high availability gap detection and reporting, RecoverGuard is now a complete, end-to-end solution (from protection of data through availability) for ensuring business continuity.  This is significant because for the first time it will give you visibility into your HA infrastructure – which has never been possible before.

If you’d like to learn more about this important new release, you can:
Read the press release
Watch the on-demand webinar Solving HA/DR Configuration Drift
Check out what the press and analysts had to say


Gap Analysis #3: Tampering Risk

March 9, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

Continuing with my gap analysis series, in this post I’ll examine what causes a Tampering Risk to occur, and its impact on the business if the risk is not detected and resolved.  

Gap: Tampering Risk

Risk: DR failure and data corruption

How does it happen? This hidden risk is the result of an unauthorized host at the DR site erroneously configured with access to one or more storage devices. This is a very common error, and, much to the surprise of many organizations, there are dozens of reasons why it can happen. In each case, however, it can remain dormant during normal operations and is only revealed during an actual full-blown disaster. Here are just a few reasons why this error may occur:

  • When performing a storage migration, the storage administrator forgets to remove old device mappings to the host. After repurposing the old devices to new hosts, some are still visible by the original, now unauthorized host.
  • From time to time, extra mapping may be added to increase performance or resiliency of access to the disk. If zoning and masking are not controlled and managed from a central point, one of the paths might actually go “astray.”
  • Sometimes HBAs are replaced not because they are faulty but because greater bandwidth is required. If soft-zoning is used and is not updated accordingly, an old HBA still retains permission to access the original storage devices. Once the HBA is reused on a different host (which can occur months after the upgrade) this host will actually get access rights to the SAN devices which belong to the original host.

What is the impact? During a disaster, a racing condition, with several unpleasant scenarios, will develop:

  • Scenario 1—The unauthorized host might gain exclusive access to the erroneously mapped disk. In this case, the designated standby will be unable to mount and use the locked devices, and it could take some time to isolate and fix the problem. There is also the risk of the unauthorized host actually using the erroneously mapped disk, thereby corrupting the data and rendering recovery impossible.
  • Scenario 2—Both the standby and the unauthorized hosts get concurrent access to the disk. If the unauthorized host attempts to use the erroneously mapped disk, not only will the data be corrupted instantly, but the now-active standby may unexpectedly crash.

Why does the DR test miss this? Simply put, because all hosts are rarely brought up at the same time. As already explained, many organizations choose to test only one subset of the environment at a time. During a test, both the original and unauthorized server would not be started at the same time, but in a real event they would, and will wreak havoc on the data.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Gap Analysis #1: Replication Inconsistencies

February 21, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

Doron’s recent post about the different types of risks that occur when configuration gaps are created got me thinking that you might be interested in more details about individual gaps. In this and subsequent posts, I’ll give you a quick description of a gap we discover in many companies, explain why it occurs, and what it means to the business.

I’ll start with Replication Inconsistencies (Different RDF Groups)

What’s the risk?  Data loss and increased time to recover.

How does it happen? This is a common gap found in large EMC SRDF/S and SRDF/A environments where multiple RDF groups are needed. It occurs most often when storage volumes from different RDF groups are provisioned to the host and used by the same database. The provisioning tools do not alert or prevent this configuration. Each RDF group is associated with different replication adapters and potentially different network infrastructures. Rolling disaster scenarios can result in corrupted replicas at the disaster recovery site.

What’s the impact? A rolling disaster scenario is characterized by the gradual failure of hardware and network, as opposed to abrupt and immediate cessation. Most real-life disasters are rolling (for example, fire, flood, virus attacks, computer crime, etc.). In a rolling disaster, network components will not fail at exactly the same time, resulting in one RDF group being out of sync with the other RDF group. This will irreversibly corrupt, the database at the disaster recovery site. Data will need to be restored from a recent backup, increasing both the RTO and the RPO.

Why does a DR test miss this? When a company conducts an orderly shutdown of applications, databases and hosts, it leaves data in a consistent state. Gradual/rolling disasters that bring systems or network elements down one by one are extremely difficult to emulate in a DR test.

Note: Many companies actually experience this problem but incorrectly assume it is the result of some network abnormality. However, unless the issue is properly diagnosed and corrected, it will reoccur.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Follow

Get every new post delivered to your Inbox.