Does Keeping Your Resume Up to Date Count as a Valid HA and DR Strategy?

August 25, 2010

By Gil Hecht, Founder and CEO, Continuity Software

Again and again, we are reminded of just how business critical high availability (HA) and disaster recovery (DR) capabilities are in today’s highly-competitive, ultra-demanding economy. Yet, an unreasonably high percentage of today’s most well-known and respected business organizations are leaving themselves vulnerable to both natural and manmade IT disasters.

One needs only to review recent headlines to see what I mean. For instance, the 7-hour outage at Singapore’s largest banking network (“Global CIO: IBM’s Bank Outage: Anatomy of a Disaster”) and the American Eagle Outfitters 8-day-long disaster (“Oracle Backup Failure Major Factor in American Eagle 8-Day Crash”).

Clearly, HA and DR is a persistent challenge for many data centers, regardless of industry or size. And, while most data centers have implemented an HA and/or DR strategy, most understand there is no guarantee it will actually deliver. Due to the time and expense involved, it usually gets tested once or twice a year. Then, over the following days, weeks and months, changes are made to the production environment that are not replicated appropriately, and the HA/DR strategy is rendered virtually useless.

On a daily basis, I meet with many extremely experienced and talented data center managers to talk about how to ensure their organization’s HA and DR. Many do privately admit that while HA and DR is a high business priority – from both an internal governance and/or external legal regulations standpoint – they recognize that if they were to experience a true disaster, data and application availability would probably be lost for an amount of time that far exceeds SLA guidelines (if not permanently). In fact, one IT executive joked that his DR strategy was to, “Keep my resume up to date.”

OK, just for the sake of argument… How about an affordable and easy to manage solution that mitigates data protection and high availability risks by detecting gaps and vulnerabilities between your primary production, HA cluster and/or remote DR sites?


A cautionary note on replicating temporary databases and page files

February 7, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Team

We recently encountered an interesting infrastructure problem on a customer site. Since this may be relevant to others as well, I thought it would be best to share with all.

The customer is using EMC SRDF/A for disaster recovery. It had thousands of production servers, thousands of databases and applications, hundreds of replicated servers. On the 10th of each month, the company had to generate a large number of monthly customer analysis reports, which caused database I/O on tier-1 storage to reach very high levels. As a result, the corresponding RDF group could not keep up with the pace, and went offline (suspended).  Since the customer’s RPO policy is 30 minutes and since it usually took 24 hours until the issue was resolved (databases to return to normal I/O rates and manual intervention to bring the RDF group back online), this was unacceptable.

We looked at the tickets that had been generated by RecoverGuard during its regular scans of the environment and found that all the temporary databases for the MS-SQL and Oracle environments were replicated. We also found several swap devices being replicated. We suggested removing replication for these temporary databases and swap devices. Although it was not easy (some of the temporary databases had to be relocated, since they shared storage with other databases), it paid off. The change reduced load from the RDF group enough that it will no longer fail.

Conclusion: Avoid replicating temporary databases and page files. You can still locate those on high-end storage for best performance, but do not create replications for the corresponding devices.


Defining Data Protection Terms

February 4, 2009

By Doron Pinhas
Vice President, Field Operations

There seems to be some confusion with the terminologies being used today around data protection. I hear many people talk about Data Protection Management (“DPM”), backup reporting, disaster recovery when trying to differentiate their offering, which tends to create some confusion when trying to understand the value that each vendor brings and the problem he solves.

Here is OUR short definition of the Data Protection segments:

  • Backup – taking a “point-in-time” copy of the data/application. Approaches include backup-to-ALL as well as online snapshots of the data (such as BCV or Clone in the EMC world). Verification tools in this space helps you ensure your backup is complete and optimize the backup process.
  • High Availability - increasing business service availability – technologies include virtualization, clusters, raids, load balancing (to some degree) and other hardware tweaks. Verification tools are your typical system management technologies that helps gauge the overall health of your system.
  • Disaster Recovery - resume business operation from a remote data center – approach is based on either a shared facility or maintaining a dedicated “mirrored” facility. Technologies include tape-based recovery (ship your tapes to the remote site) and online replication (storage/database/server based) which is based on copying all data to the mirrored site, so you can resume your business service from that site without interruptions.

I would love to hear your opinions and your definition of DPM.


Follow

Get every new post delivered to your Inbox.