Want to know how your service availability metrics and DR practices compare to your industry peers?
A new survey conducted by Continuity Software will help you find out.
Download the survey results and get answers to questions such as:
- How does your availability performance compare to your industry peers?
- What were the most common causes for outages in 2011?
- How often are DR tests conducted? How reliable are they?
- How do organizations manage DR on the cloud?
- And many more…
Today’s Topic: Cluster Shared SAN Configuration Drift
The most common way to share data between cluster nodes is through the use of multi-homed SAN storage. Inconsistent access to the SAN volumes by cluster nodes is a state in which one or more shared volumes are not mapped to one or more nodes.
Sharing is intended to guarantee immediate data availability in case of a failover, but inconsistent mapping might put failover in jeopardy.
Why Does It Happen?
The initial configuration of a cluster is typically correct. However, routine configuration changes such as adding a new storage volume or extending the cluster to additional nodes could gradually result in a configuration drift that leaves one or more shared volumes un-mapped to some of the nodes.
What Is the Impact?
In the event of a cluster failover to the passive node, data stored on an up-mapped volume will not be available, leading to downtime of any application which requires access to a database or files stored on these volumes.
How Can It Be Avoided?
There are multiple ways to minimize the risk of such configuration drift:
1. Documentation: Put in place clear and well-documented procedures for any changes introduced to the cluster configuration.
2. Training: Conduct periodic training for all involved personnel to review possible availability risks introduced by production environment modifications.
3. Automation: Implement automated auditing of your high availability environment to ensure passive node configuration is always consistent with active node configuration.
Learn more about Automated Daily High Availability Testing
Today’s Hidden Risk: Invalid Database Snapshot
An invalid database copy can be created when a storage-based snapshot (or any other type of copy, such as replica, clone, etc.) of the database is taken while the database is not in “hot backup” mode. In this case, the copy may be unusable when a restore is attempted.
- How can an invalid database snapshot occur in your environment?
- Why will it come back to bite you?
- What will be the impact on your environment?
- How can you avoid the risk?
This risk is one of 5,000+ downtime and data loss vulnerabilities identified in over one hundred of the largest data centers worldwide included in Continuity Software’s community-sourced knowledgebase.
Read more now for free tips that will help you avoid this risk
Based on data gathered from 88 organizations worldwide, the Vulnerability Index Benchmark by Continuity Software provides a first of its kind measurement of downtime and data loss risk for each organization, grouped by industry sector.
Get your complimentary copy of the 2011 Vulnerability Index Benchmark study and find out:
- Which areas of the IT infrastructure present the greatest HA/DR risks?
- Which industry sectors exhibit the highest levels of downtime and data loss risks?
- What types of HA/DR risks are most common at each sector?
- How can you compare your risk to your industry peers?
There has been a flurry of recent reports regarding outages leading to downtime and data loss. Here is just a small sample:
- This Article in USA Today is titled “United’s flight mess latest caused by computer glitches,” but also discusses similar outages at Alaska Airlines, Southwest Airlines, Spirit Airlines, and the Dutch city of Maastricht.
- An Exchange online outage at no less than Microsoft.
- ‘Signal storm’ caused Telenor outages: 3 million users left with no service for 18 hours.
- TDS Telecom restores service in Hancock County: Firm’s office in Arcadia, Ohio was struck by lightning (can happen everywhere, really).
Clearly all the above invest heavily in HA/DR solutions, so why were they caught unprepared when an outage occurred?
We’ve addressed this question in previous posts such as Why DR testing doesn’t work, Think that little config change is minor? Think again and BCP is not different than other IT departments . The short answer is that today’s datacenters are too complex and dynamic to rely on periodic tests or audits.
Conclusion: real-time verification of HA/DR readiness is required.
How can this be accomplished? Considering the amount of labor put into a single audit/test, automation is a must. Disaster Recovery Management solutions such as RecoverGuard and DR Advisor can help you achieve the following benefits:
- Continuous non-intrusive end-to-end infrastructure scan for 24×7 visibility into DR risks
- HA/DR risk detection and readiness of passive systems for failover
- Ability to measure the de-facto RPO and other HA/DR SLA metrics
- Comprehensive HA/DR readiness reporting
To learn how the best datacenters in the world accomplish this feat, check out our latest webinar.
Your company decided to replicate production Oracle databases to a remote DR site using Data Guard. The database administration crew set it up and made several tests to ensure that it works properly. How do you know that it will also work properly tomorrow or the day after? Many things can go wrong, rendering the standby Oracle database not fit for recovery. Some risks would not be specifically related to Oracle configuration. Others may be very subtle and difficult to identify manually. The standby database may fail to start when you need it the most. or maybe it will start, but some of the data will be missing (ouch!). Or maybe it’ll “just” perform very badly after the fail-over and cause service disruption. I’ve selected 4 examples of common Oracle Data Guard vulnerabilities’ to share with you. Obviously there are thousands of risks that may affect the availability and recoverability of an Oracle database in Data Guard mode. So, here there are:
1. Standby database not synchronized with its primary Oracle database
Like the vast majority of the companies which chose Data Guard, your company probably decided to set it up in the default “MAX PERFORMANCE” mode, which basically puts the performance of the source database as 1st priority and standby synchronization only as a 2nd priority. Redo logs are written a-synchronously to the standby database and if there are delays, then standby database falls behind. It’s likely to assume that on rush hours, the gap between the source and standby database would be the highest. If a failure would occur during this time, significant amount of data could be lost. BCP Manager – how would you know whether DataGuard synchronization complies with your RPO goal? You don’t!
Of course, there are many others reasons for the Standby Oracle to fall behind the source database such as network issues causing heartbeat failures is one examples, storage configuration on the standby servers and more.
2. “Force Logging” being disabled for a primary Oracle database
Enabling “Force Logging” is one of many Oracle best practices for Data Guard environments.
Few words about Force Logging – Oracle provides a means of forcing the writing of redo records for changes against the database, even where NOLOGGING has been specified in DDL statements. Any un-logged operations would invalidate the standby database and would require substantial DBA intervention in order to manually propagate un-logged operations.
3. The archiver of an Oracle instance is stopped
On the primary database, Data Guard uses an archiver process to collect transaction redo data and transmit it to standby destinations. The archiver is a key process in a Data Guard environment. Without it, synchronization will not take place. Every now and then DBAs do some maintenance work, stop the archiver process but forgot to bring it back online. It’s only human to make such mistakes from time to time and there’s nothing you can do to avoid them.
4. Critical Primary-Standy OS Configuration Differences
Difference in the configuration of key kernel parameters (open files, semaphores, shared memory, threads) would result in either failure to start the instance on the Standby server in case of failure or in the instance providing an “unexplained” poor service level (stability, performance).
Thousands of things can go wrong every day without you even knowing about it. Testing some of the systems once in a while is hardly enough. The only viable way is to automate DR verification with a proper tool. A tool that will perform a daily read-only scan of your infrastructure and guarantee that availability and protection levels are high, and that no new risks have emerged. A tool that can handle all the above examples and much more. Come visit us at www.continuitysoftware.com and check out our risk free 48-hours pilot for RecoverGuard.