SLA Management™ Now Available in RecoverGuard 5.0

February 3, 2010

The planning and execution of disaster recovery procedures often involves multiple teams within an organization, including business continuity managers, as well as storage, system/application, and other groups within the IT department.  However, poor communication and collaboration, as well as conflicting objectives, can create a disconnect between these various teams.  As a result, they often over- or under-provision the disaster recovery systems designed to ensure data protection and availability – causing either unacceptable exposure or significant waste of resources when an event occurs.

Our new Service Level Agreement (SLA) Management module, available with RecoverGuard 5.0, can help organizations overcome these challenges, providing them with a robust solution that ensures sufficient protection at all times, while eliminating the wasted money or staff time associated with over-provisioning.  Those responsible for business continuity and data protection will gain greater control over and visibility into the storage resources they dedicate for disaster recovery purposes.

With a broad range of powerful capabilities, our SLA Management solution can enable companies to avoid the problems typical of disaster recovery procedures, such as lack of redundancy, too few copies of critical data, and over-inflated costs.  This empowers them to more effectively and economically optimize storage allocation and utilization to meet application performance goals and requested service levels.

Key features of our SLA Management module include:

  • An intuitive, business-oriented SLA definition builder that makes it easy for even those users with little or no technical savvy to define levels of service, and associate them with services, servers, databases, and other technology assets.  For example, users can define how often remote copies are refreshed, the type of storage to be used, how long local copies will be retained, or the level of redundancy (if any).
  • Comprehensive reporting that allows IT staff to assess existing service levels, and compare them to policies and guidelines.
  • Real-time alerts that immediately notify stakeholders when deviations from SLA rules take place.

 

Visit our Web site to learn more about our new SLA Management module, and how it can help your business achieve maximum data protection, in the most efficient and cost-effective manner possible.


Note about New Software Release

August 3, 2009

By Gil Hecht
CEO

I thought I’d use this space to give you a quick update on the latest version of our RecoverGuard automated disaster recovery/high availability testing and monitoring software. As you know, RecoverGuard automatically scans your IT infrastructure to find hidden configuration errors or data protection gaps before they impact your operations.  Here’s a brief summary of the new functionality in Version 4.3:

  • Support for EMC CLARiiON platform, including MirrorView and SnapView. RecoverGuard also supports EMC Symmetrix, NetApp and HDS USP & AMS.  In an upcoming release we’ll expand this list further by adding support for HP XP and IBM DS.
  • Configuration testing support for all major cluster vendors, enabling you to ensure availability at all times
  • Significant advancements to the Availability Advisor, enabling RecoverGuard to scan for risks in kernel parameters, storage routing, domain and DNS settings, patches and service packs, and much more .

Of course, we continue to add new DR/HA risks to RecoverGuard’s robust gap detection knowledgebase, which currently holds over 3,000 potential known risks. We’ve posted some of the more common gaps on this blog, and will continue to do so as time goes on. You can see all of those we’ve posted by clicking on “Gap Analysis” under “Categories” in the upper right portion of this blog. If you’d like to see even more examples, you can see them on our website at http://www.continuitysoftware.com/commongaps


Gap Analysis #6: Configuration Drift between Production and HA

May 16, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

Here’s a gap that we frequently see in HA environments.

Gap: Configuration Drifts between HA Cluster Nodes

Risk: Downtime; manual intervention needed to recover

How does it happen?
While there are many ways this can occur, let’s look at one example: the passive node does not have redundancy in the HBA level nor in the DNS configuration. The currently active node is configured with redundancy for these elements. A single HBA/DNS server configuration is a single point of failure. Upon fail-over/switch-over to the currently passive node, the applications running on this cluster will suffer from reduced availability/MTBF and more downtime. In addition, the passive node is configured with significantly less maximum allowed open files, which may lead to application failures. Moreover, the passive node has only 1GB of swap while the active node was configured with additional 4GB. Upon fail-over, the applications may not have sufficient memory to run properly. Lastly, differences in installed products may have various impacts, depending on the product type.

What is the impact?
This will vary depending upon the specific drift, but can include a failure to switch-over/fail-over/switch-over to other node (causing downtime), or reduced performance after fail-over/switch-over which will, at best, create an operations slowdown and at worst leave the node unable to carry the load

Can it happen to me?
This situation occurs frequently in HA environments. The configuration of a host involves so many details that is it very difficult to ensure an HA server is fully synchronized to its production host at all times.

If you like to read more gap analyses, go to our website at http://www.continuitysoftware.com/commongaps


Gap Analysis #5: Insufficient DR Configuration/Resources

March 22, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

This gap was recently uncovered at a large insurance company. Since it is one that comes up a lot during routine infrastructure HA/DR monitoring, and inevitability surprises the IT organization, I thought it would be a good one to focus on today.

Gap: Insufficient DR Configuration/Resources

Risk: Extended recovery times, Recovery Time Objective violation

How does it happen? DR and production infrastructures are usually not the same. When building a DR data center, organizations tend to assign fewer resources than their production environments have. If the DR configuration includes significantly fewer resources than production, there is a good chance it will be unable to assume production properly upon failover. It is not unusual, for example, to find a production environment that has multiple paths to storage or software, but the DR environment has to few, or even just a single path. It is also common to find DR sites with misconfigured kernel parameters or insufficient memory or CPU to support full production load.

What is the impact? When the DR site cannot assume production as planned, business operations cannot resume in accordance with the company’s established SLA. In the best case scenario, IT must devote additional resources to execute the unplanned configuration of servers and storage. In the worst case scenario, the company will need to incur additional unplanned capital expenses.

Why does the DR test miss it? Most DR tests do not simulate full production load, so these errors remain undetected. Since DR is mostly offline, this issue never comes to life until an emergency occurs.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Gap Analysis #4: Point-in-time copies never tested

March 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

So far in my Gap Analysis series we’ve covered replication inconsistencies, missing networking resources and tampering risk gaps. Today I’m going to take a look at what happens when point-in-time copies are never tested.

Gap: Point-in-time copies never tested

Risk: Data loss and increased time to recover

How does it happen? Point-in-time copies like snapshots and BCVs are the second line of defense to protect against human errors, viruses and outages as well. The DR configuration for applications typically includes:

  • Multiple local point-in-time copies such as EMC TimeFinder, HDS ShadowImage/Snapshot, NetApp FlexClone/Snapshot, or CLARiiON SnapView;
  • Remote synchronous replication such as EMC SRDF, Hitachi TrueCopy, CLARiiON MirrorView, and NetApp SnapMirror; and
  • Local point-in-time copies on the remote site.

In addition, the copies could be mapped to the target DR servers, configured with multi-path software such as EMC PowerPath, Veritas DMP and MPIO, and defined in logical volumes such as Veritas VxVM.

Point-in-time copies can easily become corrupt, without without being discovered, unless the application is fully started and the data integrity is thoroughly tested. There are numerous scenarios that can lead to such a corruption, such as when the replica devices do not all belong to the same consistency group.

What is the impact? The replica is corrupt and unusable. The file system will need to be recreated at the disaster recovery site and data restored from a recent backup, thereby increasing the time to recovery. All data created since the last backup will be lost. Corrupted file systems may still be usable in many cases, and only a close inspection of the content can reveal the fact that the data is meaningless.

Why does the DR test miss this? This gap can be missed if the specific business service is not tested for DR or if the DR test only includes turning on the DR server without actually running the applications.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Gap Analysis #3: Tampering Risk

March 9, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

Continuing with my gap analysis series, in this post I’ll examine what causes a Tampering Risk to occur, and its impact on the business if the risk is not detected and resolved.  

Gap: Tampering Risk

Risk: DR failure and data corruption

How does it happen? This hidden risk is the result of an unauthorized host at the DR site erroneously configured with access to one or more storage devices. This is a very common error, and, much to the surprise of many organizations, there are dozens of reasons why it can happen. In each case, however, it can remain dormant during normal operations and is only revealed during an actual full-blown disaster. Here are just a few reasons why this error may occur:

  • When performing a storage migration, the storage administrator forgets to remove old device mappings to the host. After repurposing the old devices to new hosts, some are still visible by the original, now unauthorized host.
  • From time to time, extra mapping may be added to increase performance or resiliency of access to the disk. If zoning and masking are not controlled and managed from a central point, one of the paths might actually go “astray.”
  • Sometimes HBAs are replaced not because they are faulty but because greater bandwidth is required. If soft-zoning is used and is not updated accordingly, an old HBA still retains permission to access the original storage devices. Once the HBA is reused on a different host (which can occur months after the upgrade) this host will actually get access rights to the SAN devices which belong to the original host.

What is the impact? During a disaster, a racing condition, with several unpleasant scenarios, will develop:

  • Scenario 1—The unauthorized host might gain exclusive access to the erroneously mapped disk. In this case, the designated standby will be unable to mount and use the locked devices, and it could take some time to isolate and fix the problem. There is also the risk of the unauthorized host actually using the erroneously mapped disk, thereby corrupting the data and rendering recovery impossible.
  • Scenario 2—Both the standby and the unauthorized hosts get concurrent access to the disk. If the unauthorized host attempts to use the erroneously mapped disk, not only will the data be corrupted instantly, but the now-active standby may unexpectedly crash.

Why does the DR test miss this? Simply put, because all hosts are rarely brought up at the same time. As already explained, many organizations choose to test only one subset of the environment at a time. During a test, both the original and unauthorized server would not be started at the same time, but in a real event they would, and will wreak havoc on the data.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Gap Analysis #2: Missing Network Resources

March 2, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

In my last post I took a closer look at the problems created by Replication Inconsistenies. Here’s a detailed look at another very common gap, why it happens, and how it can impact your operations.

Gap: Missing Network Resources

Risk: Extended recovery time

How does it happen? This risk can generally be traced to a configuration mistake which occurs when DR is not considered during the configuration process. The source host is accessing network file systems (CIFS/NFS). The network file systems are stored on a production server/array/NAS device. The target DR server also accesses the network file systems from the same production server on the production site. During a DR test, the production file server is not brought offline and the test succeeds. During a real disaster the production server will not be available.

What is the impact? If the network file systems were not replicated to a DR site, data loss will result. If the systems are replicated, recovery time will be extended while the administrator locates corresponding file systems on the DR site and mounts them on the DR standby server. This assumes the organization has excellent site documentation. Without it, however, data loss will occur.

Why does the DR test miss this? When running the DR test for a specific business service or application, most companies do not shut down the entire production datacenter. The DR test will be successful because the other assets are accessible and responding. Therefore, the DR site will use the production system unknowingly.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Gap Analysis #1: Replication Inconsistencies

February 21, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

Doron’s recent post about the different types of risks that occur when configuration gaps are created got me thinking that you might be interested in more details about individual gaps. In this and subsequent posts, I’ll give you a quick description of a gap we discover in many companies, explain why it occurs, and what it means to the business.

I’ll start with Replication Inconsistencies (Different RDF Groups)

What’s the risk?  Data loss and increased time to recover.

How does it happen? This is a common gap found in large EMC SRDF/S and SRDF/A environments where multiple RDF groups are needed. It occurs most often when storage volumes from different RDF groups are provisioned to the host and used by the same database. The provisioning tools do not alert or prevent this configuration. Each RDF group is associated with different replication adapters and potentially different network infrastructures. Rolling disaster scenarios can result in corrupted replicas at the disaster recovery site.

What’s the impact? A rolling disaster scenario is characterized by the gradual failure of hardware and network, as opposed to abrupt and immediate cessation. Most real-life disasters are rolling (for example, fire, flood, virus attacks, computer crime, etc.). In a rolling disaster, network components will not fail at exactly the same time, resulting in one RDF group being out of sync with the other RDF group. This will irreversibly corrupt, the database at the disaster recovery site. Data will need to be restored from a recent backup, increasing both the RTO and the RPO.

Why does a DR test miss this? When a company conducts an orderly shutdown of applications, databases and hosts, it leaves data in a consistent state. Gradual/rolling disasters that bring systems or network elements down one by one are extremely difficult to emulate in a DR test.

Note: Many companies actually experience this problem but incorrectly assume it is the result of some network abnormality. However, unless the issue is properly diagnosed and corrected, it will reoccur.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Follow

Get every new post delivered to your Inbox.