Another Hidden downtime risks that can come back to bite you

January 17, 2012

Today’s Topic: Cluster Shared SAN Configuration Drift

The most common way to share data between cluster nodes is through the use of multi-homed SAN storage. Inconsistent access to the SAN volumes by cluster nodes is a state in which one or more shared volumes are not mapped to one or more nodes.

 

Sharing is intended to guarantee immediate data availability in case of a failover, but inconsistent mapping might put failover in jeopardy.

 

Why Does It Happen?

The initial configuration of a cluster is typically correct. However, routine configuration changes such as adding a new storage volume or extending the cluster to additional nodes could gradually result in a configuration drift that leaves one or more shared volumes un-mapped to some of the nodes.
What Is the Impact?
In the event of a cluster failover to the passive node, data stored on an up-mapped volume will not be available, leading to downtime of any application which requires access to a database or files stored on these volumes.

 

How Can It Be Avoided?

There are multiple ways to minimize the risk of such configuration drift:

1.     Documentation: Put in place clear and well-documented procedures for any changes introduced to the cluster configuration.

2.     Training: Conduct periodic training for all involved personnel to review possible availability risks introduced by production environment modifications.

3.     Automation: Implement automated auditing of your high availability environment to ensure passive node configuration is always consistent with active node configuration.

Learn more about Automated Daily High Availability Testing


Hidden recovery risk that will come back to bite you

January 9, 2012

Today’s Hidden Risk: Invalid Database Snapshot

An invalid database copy can be created when a storage-based snapshot (or any other type of copy, such as replica, clone, etc.) of the database is taken while the database is not in “hot backup” mode. In this case, the copy may be unusable when a restore is attempted.

 

Click here to learn:

  • How can an invalid database snapshot occur in your environment?
  • Why will it come back to bite you?
  • What will be the impact on your environment?
  • How can you avoid the risk?

This risk is one of 5,000+ downtime and data loss vulnerabilities identified in over one hundred of the largest data centers worldwide included in Continuity Software’s community-sourced knowledgebase.
Read more now for free tips that will help you avoid this risk


The Vulnerability Index Benchmark

December 26, 2011

Based on data gathered from 88 organizations worldwide, the Vulnerability Index Benchmark by Continuity Software provides a first of its kind measurement of downtime and data loss risk for each organization, grouped by industry sector.

Get your complimentary copy of the 2011 Vulnerability Index Benchmark study and find out:

  • Which areas of the IT infrastructure present the greatest HA/DR risks?
  • Which industry sectors exhibit the highest levels of downtime and data loss risks?
  • What types of HA/DR risks are most common at each sector?
  • How can you compare your risk to your industry peers?

Link to download the  2011 Vulnerability Index Benchmark study


Recent outage events catch BCP with their pants down

August 8, 2011

There has been a flurry of recent reports regarding outages leading to downtime and data loss. Here is just a small sample:

Clearly all the above invest heavily in HA/DR solutions, so why were they caught unprepared when an outage occurred?

We’ve addressed this question in previous posts such as Why DR testing doesn’t work, Think that little config change is minor? Think again and BCP is not different than other IT departments . The short answer is that today’s datacenters are too complex and dynamic to rely on periodic tests or audits.

Conclusion:  real-time verification of HA/DR readiness is required.

How can this be accomplished? Considering the amount of labor put into a single audit/test, automation is a must. Disaster Recovery Management solutions such as RecoverGuard and DR Advisor can help you achieve the following benefits:

  • Continuous non-intrusive end-to-end infrastructure scan for 24×7 visibility into DR risks
  • HA/DR risk detection and readiness of passive systems for failover
  • Ability to measure the de-facto RPO and other HA/DR SLA metrics
  • Comprehensive HA/DR readiness reporting

To learn how the best datacenters in the world accomplish this feat, check out our latest webinar.


4 Oracle DataGuard Recovery Risks

July 19, 2011

Your company decided to replicate production Oracle databases to a remote DR site using Data Guard. The database administration crew set it up and made several tests to ensure that it works properly.  How do you know that it will also work properly tomorrow or the day after? Many things can go wrong, rendering the standby Oracle database not fit for recovery. Some risks would not be specifically related to Oracle configuration. Others may be very subtle and difficult to identify manually.  The standby database may fail to start when you need it the most. or maybe it will start, but some of the data will be missing (ouch!). Or maybe it’ll “just” perform very badly after the fail-over and cause service disruption. I’ve selected 4 examples of common Oracle Data Guard vulnerabilities’ to share with you. Obviously there are thousands of risks that may affect the availability and recoverability of an Oracle database in Data Guard mode. So, here there are:

1.      Standby database not synchronized with its primary Oracle database 

Like the vast majority of the companies which chose Data Guard, your company probably decided to set it up in the default “MAX PERFORMANCE” mode, which basically puts the performance of the source database as 1st priority and standby synchronization only as a 2nd priority. Redo logs are written a-synchronously to the standby database and if there are delays, then standby database falls behind. It’s likely to assume that on rush hours, the gap between the source and standby database would be the highest. If a failure would occur during this time, significant amount of data could be lost. BCP Manager – how would you know whether DataGuard synchronization complies with your RPO goal? You don’t!

Of course, there are many others reasons for the Standby Oracle to fall behind the source database such as network issues causing heartbeat failures is one examples, storage configuration on the standby servers and more.

2.      “Force Logging” being disabled for a primary Oracle database 

Enabling “Force Logging” is one of many Oracle best practices for Data Guard environments.

Few words about Force Logging – Oracle provides a means of forcing the writing of redo records for changes against the database, even where NOLOGGING has been specified in DDL statements. Any un-logged operations would invalidate the standby database and would require substantial DBA intervention in order to manually propagate un-logged operations.

3.      The archiver of an Oracle instance is stopped

On the primary database, Data Guard uses an archiver process to collect transaction redo data and transmit it to standby destinations. The archiver is a key process in a Data Guard environment. Without it, synchronization will not take place. Every now and then DBAs do some maintenance work, stop the archiver process but forgot to bring it back online. It’s only human to make such mistakes from time to time and there’s nothing you can do to avoid them.

4.      Critical Primary-Standy OS Configuration Differences

Difference in the configuration of key kernel parameters (open files, semaphores, shared memory, threads) would result in either failure to start the instance on the Standby server in case of failure or in the instance providing an “unexplained” poor service level (stability, performance).

Thousands of things can go wrong every day without you even knowing about it. Testing some of the systems once in a while is hardly enough. The only viable way is to automate DR verification with a proper tool. A tool that will perform a daily read-only scan of your infrastructure and guarantee that availability and protection levels are high, and that no new risks have emerged. A tool that can handle all the above examples and much more. Come visit us at www.continuitysoftware.com and check out our risk free 48-hours pilot for RecoverGuard.


Quick Guidebook: Top 10 Private Cloud Risks

March 10, 2011

Enterprises routinely build Disaster Recovery and High Availability measures into their private cloud infrastructure, so why do downtime and data loss risks still exist?

Reality is that even the most robust Disaster Recovery and High Availability (DR/HA) plans are only as good as your ability to test them.

To help you out, we have assembled a community-driven database of over 4,000 issues that pose downtime and data loss risks. While we would love to share them all, you can start with a peek at ten top risks in the private cloud environment.

http://www.continuitysoftware.com/downloads/Top10PrivateCloudRisks.pdf

Yahav Adorian
Continuity Software


VMware ESX: data loss / downtime risks and how to avoid them

December 23, 2010

Continuing my previous virtualization posts, I’d like to take this time to describe additional examples of what-could-go-wrong-in-my-private-cloud. Here they are:

  1. Replication issues. For instance, a VM which is stored on an un-replicated LUN (or partially replicated set of LUNs); or maybe it is replicated but last synchronization was done months ago? replication was turned off for maintenance and never brought back online….
  2. SAN I/O multipath issues. In this category you may find issues such as dead I/O paths, paths configured with incorrect I/O policies, insufficient number of paths or unequal number of paths between nodes; all these and more could result in suboptimal VM/ESX operation/performance and reduced availability (in other words – more downtime). By the way, see vSphere 4.0’s release note about the use Round-Robin algorithm for I/O load balancing…
  3. Configuration drift between clustered ESX servers. Over time difference between the nodes may arise as it relates to Hardware, Software, patches, Network (etc.). These differences would result in different levels of stability, availability and performance depending on the node on which the VM is currently running.
  4. Image Consistency (aka point-in-time copies). Specific solutions and/or procedures must be applied in order to guarantee that a snapshot taken for a VM is consistent and usable. This may include different techniques of I/O freeze – such as Oracle hot backup, VM Suspension, Storage consistency groups and so one.

The award winning RecoverGuard by Continuity Software is a solution that can help you identify and report these configuration errors immediately as they occur; thus dramatically decrease the frequency of downtime events, reduce amount of work around DR testing and significantly improve recoverability.


Does Keeping Your Resume Up to Date Count as a Valid HA and DR Strategy?

August 25, 2010

By Gil Hecht, Founder and CEO, Continuity Software

Again and again, we are reminded of just how business critical high availability (HA) and disaster recovery (DR) capabilities are in today’s highly-competitive, ultra-demanding economy. Yet, an unreasonably high percentage of today’s most well-known and respected business organizations are leaving themselves vulnerable to both natural and manmade IT disasters.

One needs only to review recent headlines to see what I mean. For instance, the 7-hour outage at Singapore’s largest banking network (“Global CIO: IBM’s Bank Outage: Anatomy of a Disaster”) and the American Eagle Outfitters 8-day-long disaster (“Oracle Backup Failure Major Factor in American Eagle 8-Day Crash”).

Clearly, HA and DR is a persistent challenge for many data centers, regardless of industry or size. And, while most data centers have implemented an HA and/or DR strategy, most understand there is no guarantee it will actually deliver. Due to the time and expense involved, it usually gets tested once or twice a year. Then, over the following days, weeks and months, changes are made to the production environment that are not replicated appropriately, and the HA/DR strategy is rendered virtually useless.

On a daily basis, I meet with many extremely experienced and talented data center managers to talk about how to ensure their organization’s HA and DR. Many do privately admit that while HA and DR is a high business priority – from both an internal governance and/or external legal regulations standpoint – they recognize that if they were to experience a true disaster, data and application availability would probably be lost for an amount of time that far exceeds SLA guidelines (if not permanently). In fact, one IT executive joked that his DR strategy was to, “Keep my resume up to date.”

OK, just for the sake of argument… How about an affordable and easy to manage solution that mitigates data protection and high availability risks by detecting gaps and vulnerabilities between your primary production, HA cluster and/or remote DR sites?


Cutting HA/DR costs with analytics

July 12, 2010

Availability, Recoverability and Data Protection are critical to any enterprise. The alternative cost is unacceptable in capital and reputation losses. Thus, significant time, resources and money are allocated to ensure that all business lines are highly available and can be recovered at various circumstances (from accidental file deletion to earthquake). However, the datacenter is constantly changing and despite the huge effort and world-class IT experts, new risks emerge on a regular basis. Traditional mitigation approaches do not salvage and regularly cost in downtime, potential data loss and excessive operations since:
• Discrete data is gathered by discrete systems, but none correlates all required layers: storage, OS, Database, replication, clustering, etc.
• Home-grown data collection and correlation is economically irrational

Using an HA/DR analytics solution, organizations can dramatically increase availability and recoverability levels in one hand and on the other hand – save significant time and money. An HA/DR analytics solution such as RecoverGuard by Continuity Software or DRA by Symantec analyzes thousands of potential risks by correlating configuration of applications, databases, file systems, servers, storage, replication, clustering and “what’s between”. It keeps getting updated regularly and many like to think of it as an “anti-virus for HA/DR”.

So how can an HA/DR analytics solution help cut down HA/DR costs? It’s simple really. Most enterprises are over spending in the three following direct-cost areas:
• Cost of avoidable downtime
• DR testing operations expense
• HA/DR related sub-optimal resource utilization

Let’s explore each of these cost areas.
Cost of avoidable downtime. Unsuccessful cluster failover, single point of failures in storage network/multipath, RAID level issues, risky layout of database files, suboptimal configuration of database vs. file systems vs. storage leading to unacceptable performance… all these and much more can be completely avoided by deploying an HA/DR analytics solution. If an hour of downtime costs 100K, a very serious cost reduction opportunity lies here.
DR testing operations expense. The organization can become aware of DR readiness before actually performing the test and failing over. Thus, only execute a DR drill only after resolving known recoverability issues. By doing so, significant time and resources are spared. Furthermore, by identifying new threats on the spot, as they emerge, it is guaranteed that resolution time and involved manpower are minimal. Last night changes are still fresh and identifying the root cause is easy – unlike when an error occurs in a yearly DR drill and no one remembers the specific change (one of many…) that was performed months before and created the error. Moreover, with the reporting features and in-depth visibility to dependencies between production and DR systems, replication, cluster configuration (and so on), a DR test requires less resources from the various IT teams and much less manual labor from the BCP personnel.
HA/DR related sub-optimal resource utilization. While it is not the main purpose of HA/DR analytics, the data gathered by such tools allows them to identify saving opportunities. Examples of such opportunities around storage saving are allocated but unused devices, old replicas, file system or raw device allocated to database but hardly used and so on. On the storage network side, replication bandwidth can be optimized with detection of excessive replication, swap replication, temp database replication and so on. Naturally hidden saving opportunities exist in other layers as well.

HA/DR readiness verification is a too complex task to be performed manually without the right tools for the job (check out my “BCP is not different than other IT departments” post). Considering the different teams involved, different layers, different products, vendors and the endless details embedded within each unique component, it is practically mission impossible manually. You know when a DR test starts but you don’t really know when it is going to end…and you don’t know whether data is recoverable at any given time and when the next downtime event will hit. Automation is the key to success and significant cost reduction. HA/DR analytics solutions can dramatically increase control over HA/DR readiness and at the same time reduce the costs considerably.


BCP is not different than other IT departments

May 20, 2010

Among other, BCP personnel bear the responsibility for recoverability in case of disaster. The BCP manager must verify that at any given time, data can be recovered and operations can be resumed successfully according to the policies (RPO, RTO) set by the organization.  This is all fine and dandy in theory, but take a moment to think about it – How exactly will a BCP manager determine recoverability status at any given time?

At best, a DR test is performed every quarter. Suppose you were responsible for the IT recoverability of a large financial institution and that DR exercises are performed on January 1st, April 1st and so on. What would you tell to the CIO should he ask you on February 15 if IT is recoverable? Would you be able to answer confidently “Yes”? The honest and correct answer would be “Sir, I do not know. We fixed the glitches found on the Jan 1 DR test so I guess IT was recoverable back then…. But now – I cannot say for sure. Probably not.”. It gets worse, right? Sure it does. Further deliberation will expose other weak spots in DR testing that we all experience – the DR test included only a small portion of IT….not all critical systems… production wasn’t really shutdown/cut-off during the test… didn’t really simulate end-users (or load scenarios)…. I can go on and on. And so, the question remains – How will BCP evaluate readiness for DR at all times?

Let’s compare notes with other datacenter departments. How does Network Security know that the network is secured? How does a system administrator know that a server is malfunctioning? The answer is simple: They have visibility into their domain. In other words, they have the tools that allow them to explore their area of responsibility, get an up-to-date detailed status and automatic notification when something goes wrong. BCP, like any other IT department, must have the right tools for the job. Yet unlike System administrators, Database admins (etc.), BCP needs a management solution that provides visibility into all IT layers and not just to servers or just to database configurations and so on. Furthermore, DRM solutions must be capable of analyzing the dependencies between the different layers and find recovery vulnerabilities.

Can you imagine any Organization with 7+ figures IT budget not purchasing a server performance solution such as HP performance manager or IBM Tivoli monitoring? Or network monitoring and event management solutions such as CA Spectrum/eHealth or HP NNM? Of course not, because it’s clear that datacenter monitoring requires automation (A too complex task to be performed continuously and accurately by human beings) and that without automated monitoring, suboptimal operation and downtime are unavoidable. BCP/Recovery management is no different. Without a DRM solution, the BCP personnel are “blind” and are un-aware of datacenter status in terms of readiness for recovery. They must put their trust and faith in the hands IT teams whose first priority is production. They might be kind enough to share some technical details with the BCP team… but a working datacenter is not based on mere kindness and “favors” but on intelligent processes which lead to an efficient, goal-oriented teamwork.

The good news is that high-end DRM solutions have emerged in the last few years, giving BCP personnel just the tools they were missing. Products such as RecoverGuard by Continuity Software and Disaster Recovery Advisor by Symantec provide BCP staff with a real-time business-oriented status of readiness for disaster (including both HA and DR). These analytics tools automatically identify hidden HA/DR risks and let the user know about them as soon as they happen. They also let the users explore the different IT layers, understand dependencies between production databases, servers, storage, remote storage, DR server and so on. If you are thinking about deploying a DRM solution, note that Continuity Software offers a risk free 48-Hour RecoverGuard pilot.

To guarantee successful recovery 365-days a year, BCP/recovery personnel must have solutions that provide visibility all across the datacenter, and that automatically and continuously perform datacenter configuration analysis to ensure no recovery gaps and vulnerabilities exist. Such DRM solutions have grown in the past few years to be an integral part of every large size IT organization as it became apparent that with such solutions, significant downtime and loss of data can be avoided.

I for one believe it a major milestone in the everlasting struggle to control HA/DR.


Follow

Get every new post delivered to your Inbox.