Gap Analysis #5: Insufficient DR Configuration/Resources

March 22, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

This gap was recently uncovered at a large insurance company. Since it is one that comes up a lot during routine infrastructure HA/DR monitoring, and inevitability surprises the IT organization, I thought it would be a good one to focus on today.

Gap: Insufficient DR Configuration/Resources

Risk: Extended recovery times, Recovery Time Objective violation

How does it happen? DR and production infrastructures are usually not the same. When building a DR data center, organizations tend to assign fewer resources than their production environments have. If the DR configuration includes significantly fewer resources than production, there is a good chance it will be unable to assume production properly upon failover. It is not unusual, for example, to find a production environment that has multiple paths to storage or software, but the DR environment has to few, or even just a single path. It is also common to find DR sites with misconfigured kernel parameters or insufficient memory or CPU to support full production load.

What is the impact? When the DR site cannot assume production as planned, business operations cannot resume in accordance with the company’s established SLA. In the best case scenario, IT must devote additional resources to execute the unplanned configuration of servers and storage. In the worst case scenario, the company will need to incur additional unplanned capital expenses.

Why does the DR test miss it? Most DR tests do not simulate full production load, so these errors remain undetected. Since DR is mostly offline, this issue never comes to life until an emergency occurs.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Gap Analysis #4: Point-in-time copies never tested

March 19, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

So far in my Gap Analysis series we’ve covered replication inconsistencies, missing networking resources and tampering risk gaps. Today I’m going to take a look at what happens when point-in-time copies are never tested.

Gap: Point-in-time copies never tested

Risk: Data loss and increased time to recover

How does it happen? Point-in-time copies like snapshots and BCVs are the second line of defense to protect against human errors, viruses and outages as well. The DR configuration for applications typically includes:

  • Multiple local point-in-time copies such as EMC TimeFinder, HDS ShadowImage/Snapshot, NetApp FlexClone/Snapshot, or CLARiiON SnapView;
  • Remote synchronous replication such as EMC SRDF, Hitachi TrueCopy, CLARiiON MirrorView, and NetApp SnapMirror; and
  • Local point-in-time copies on the remote site.

In addition, the copies could be mapped to the target DR servers, configured with multi-path software such as EMC PowerPath, Veritas DMP and MPIO, and defined in logical volumes such as Veritas VxVM.

Point-in-time copies can easily become corrupt, without without being discovered, unless the application is fully started and the data integrity is thoroughly tested. There are numerous scenarios that can lead to such a corruption, such as when the replica devices do not all belong to the same consistency group.

What is the impact? The replica is corrupt and unusable. The file system will need to be recreated at the disaster recovery site and data restored from a recent backup, thereby increasing the time to recovery. All data created since the last backup will be lost. Corrupted file systems may still be usable in many cases, and only a close inspection of the content can reveal the fact that the data is meaningless.

Why does the DR test miss this? This gap can be missed if the specific business service is not tested for DR or if the DR test only includes turning on the DR server without actually running the applications.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


New release provides end-to-end solution

March 16, 2009

by Doron Pinhas
VP, Field Operations

All contributors to this blog try very hard not to hype our company’s products and services. Our posts are written to offer insights on a topic and technology we all know very well – DR and HA. However, today I’m going to break from that position just briefly. Continuity Software has announced RecoverGuard 4.0, a new version of its automated HA/DR testing and monitoring solution. I’d like to share with you because it really does represent a major technological advancement. 

With new support for clusters, root cause analysis, high availability gap detection and reporting, RecoverGuard is now a complete, end-to-end solution (from protection of data through availability) for ensuring business continuity.  This is significant because for the first time it will give you visibility into your HA infrastructure – which has never been possible before.

If you’d like to learn more about this important new release, you can:
Read the press release
Watch the on-demand webinar Solving HA/DR Configuration Drift
Check out what the press and analysts had to say


Interesting eWeek article: Can Business Continuity Handle Unfettered Data Growth?

March 11, 2009

by Gil Hecht
CEO

I came across an interesting article yesterday in eWeek. Analysts believe that once we get past the current recession an unprecedented period of data growth is going to take place. They predicted thousands of exabytes (1 exabyte = 1,000 petabytes) will created by individuals and companies beginning in 2011. The article  explores whether today’s business continuity products and services – and IT staff — will be able to cope with this growth. Food for thought.

http://www.eweek.com/c/a/Data-Storage/Can-Business-Continuity-Handle-Unfettered-Data-Growth/


Gap Analysis #3: Tampering Risk

March 9, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

Continuing with my gap analysis series, in this post I’ll examine what causes a Tampering Risk to occur, and its impact on the business if the risk is not detected and resolved.  

Gap: Tampering Risk

Risk: DR failure and data corruption

How does it happen? This hidden risk is the result of an unauthorized host at the DR site erroneously configured with access to one or more storage devices. This is a very common error, and, much to the surprise of many organizations, there are dozens of reasons why it can happen. In each case, however, it can remain dormant during normal operations and is only revealed during an actual full-blown disaster. Here are just a few reasons why this error may occur:

  • When performing a storage migration, the storage administrator forgets to remove old device mappings to the host. After repurposing the old devices to new hosts, some are still visible by the original, now unauthorized host.
  • From time to time, extra mapping may be added to increase performance or resiliency of access to the disk. If zoning and masking are not controlled and managed from a central point, one of the paths might actually go “astray.”
  • Sometimes HBAs are replaced not because they are faulty but because greater bandwidth is required. If soft-zoning is used and is not updated accordingly, an old HBA still retains permission to access the original storage devices. Once the HBA is reused on a different host (which can occur months after the upgrade) this host will actually get access rights to the SAN devices which belong to the original host.

What is the impact? During a disaster, a racing condition, with several unpleasant scenarios, will develop:

  • Scenario 1—The unauthorized host might gain exclusive access to the erroneously mapped disk. In this case, the designated standby will be unable to mount and use the locked devices, and it could take some time to isolate and fix the problem. There is also the risk of the unauthorized host actually using the erroneously mapped disk, thereby corrupting the data and rendering recovery impossible.
  • Scenario 2—Both the standby and the unauthorized hosts get concurrent access to the disk. If the unauthorized host attempts to use the erroneously mapped disk, not only will the data be corrupted instantly, but the now-active standby may unexpectedly crash.

Why does the DR test miss this? Simply put, because all hosts are rarely brought up at the same time. As already explained, many organizations choose to test only one subset of the environment at a time. During a test, both the original and unauthorized server would not be started at the same time, but in a real event they would, and will wreak havoc on the data.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


What it takes to develop a bullet-proof HA/DR plan

March 5, 2009

by Gil Hecht
CEO

Today, most IT organizations have virtually no ability to accurately measure how well their high availability (HA) and disaster recovery (DR) plans will respond when needed.  As other posters on this blog have noted, this is primarily due to the complexity and scale of their IT infrastructures as well as the limitations of traditional HA and DR methodologies.

But as we look to the future, IT will continue to be under pressure to ensure business continuity and eliminate the possibility of even the most minute data loss.  Anything less than completely bullet-proof plans will no longer be acceptable.

I believe that automated HA and DR testing and monitoring software will help companies compensate for the shortcomings of current methodologies and deliver the bullet-proof protection companies demand. These solutions routinely check for hidden vulnerabilities and identify problems — in both virtualized and non-virtualized environments — that had previously gone undetected, allowing the issues to be addressed before they impact business operations.

The need for this technology is clearly demonstrated, which means it will not be long before companies have a number of solutions from which to choose. So what should an IT manager look for in an ideal solution? Here are some things to consider:

  • Automatic IT Discovery and Scanning - agentless technology used to scan an IT environment and collect information and configuration data from IT assets.
  • Robust Gap Detection – ability to identify the thousands of possible HA and DR gaps and vulnerabilities – including, but not limited to: interdependencies/mapping between virtual and physical layers, data tampering, data completeness, data consistency, as well as host configuration in heterogeneous environments.
  • Visualization and Reports – provide a clear view of the IT infrastructure configuration and status, as well as the discovered HA and DR gaps. Visualization provides all the necessary information to drill-down into a specific IT asset, understand its protection status, and to resolve a potential gap that was discovered.
  • Customizable Reporting Infrastructure - ticket summaries delivered on a preset schedule, via a predetermined delivery protocol.  Effective notification means fast and accurate issue resolution.  The solution should also integrate with existing configuration management databases (CMDBs) and ticket management systems, in order to maintain one point of contact for monitoring and managing the entire environment. 
  • Optimization Discovery - ability to analyze the information previously gathered to identify IT optimization and fine-tuning opportunities in the infrastructure, in order to maximize utilization and extend ROI.

A few DR testing mistakes to avoid

March 2, 2009

by Doron Pinhas
VP, Field Operations

1. Don’t keep your ECC Server available in production in a DR test.  Storage management tools such as ECC are the main tools used by system and storage administrators to understand and configure the relationship between servers and storage devices. It is common practice not to map all replica devices to the DR servers during normal operations. However, if a valid and current DR ECC environment is not maintained, there may be no easy way to tell how to map thousands of unmapped devices to the appropriate DR servers. In the stress and confusion that accompanies disaster events, this may lead to significantly extended recovery time.

2. When running a DR test, many companies will just confirm the application started or, at best, run one or two transactions before returning to production. The danger with this shortcut is that real system usage is not simulated, so it is impossible to determine if there are underlying problems – database dependencies, for example, or the ability to support the true production load – that could have an impact in a real failover event.


Gap Analysis #2: Missing Network Resources

March 2, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

In my last post I took a closer look at the problems created by Replication Inconsistenies. Here’s a detailed look at another very common gap, why it happens, and how it can impact your operations.

Gap: Missing Network Resources

Risk: Extended recovery time

How does it happen? This risk can generally be traced to a configuration mistake which occurs when DR is not considered during the configuration process. The source host is accessing network file systems (CIFS/NFS). The network file systems are stored on a production server/array/NAS device. The target DR server also accesses the network file systems from the same production server on the production site. During a DR test, the production file server is not brought offline and the test succeeds. During a real disaster the production server will not be available.

What is the impact? If the network file systems were not replicated to a DR site, data loss will result. If the systems are replicated, recovery time will be extended while the administrator locates corresponding file systems on the DR site and mounts them on the DR standby server. This assumes the organization has excellent site documentation. Without it, however, data loss will occur.

Why does the DR test miss this? When running the DR test for a specific business service or application, most companies do not shut down the entire production datacenter. The DR test will be successful because the other assets are accessible and responding. Therefore, the DR site will use the production system unknowingly.

If this is of interest to you, you check out some other typical gaps on our website: http://www.continuitysoftware.com/commongaps


Follow

Get every new post delivered to your Inbox.