4 Oracle DataGuard Recovery Risks

July 19, 2011

Your company decided to replicate production Oracle databases to a remote DR site using Data Guard. The database administration crew set it up and made several tests to ensure that it works properly.  How do you know that it will also work properly tomorrow or the day after? Many things can go wrong, rendering the standby Oracle database not fit for recovery. Some risks would not be specifically related to Oracle configuration. Others may be very subtle and difficult to identify manually.  The standby database may fail to start when you need it the most. or maybe it will start, but some of the data will be missing (ouch!). Or maybe it’ll “just” perform very badly after the fail-over and cause service disruption. I’ve selected 4 examples of common Oracle Data Guard vulnerabilities’ to share with you. Obviously there are thousands of risks that may affect the availability and recoverability of an Oracle database in Data Guard mode. So, here there are:

1.      Standby database not synchronized with its primary Oracle database 

Like the vast majority of the companies which chose Data Guard, your company probably decided to set it up in the default “MAX PERFORMANCE” mode, which basically puts the performance of the source database as 1st priority and standby synchronization only as a 2nd priority. Redo logs are written a-synchronously to the standby database and if there are delays, then standby database falls behind. It’s likely to assume that on rush hours, the gap between the source and standby database would be the highest. If a failure would occur during this time, significant amount of data could be lost. BCP Manager – how would you know whether DataGuard synchronization complies with your RPO goal? You don’t!

Of course, there are many others reasons for the Standby Oracle to fall behind the source database such as network issues causing heartbeat failures is one examples, storage configuration on the standby servers and more.

2.      “Force Logging” being disabled for a primary Oracle database 

Enabling “Force Logging” is one of many Oracle best practices for Data Guard environments.

Few words about Force Logging – Oracle provides a means of forcing the writing of redo records for changes against the database, even where NOLOGGING has been specified in DDL statements. Any un-logged operations would invalidate the standby database and would require substantial DBA intervention in order to manually propagate un-logged operations.

3.      The archiver of an Oracle instance is stopped

On the primary database, Data Guard uses an archiver process to collect transaction redo data and transmit it to standby destinations. The archiver is a key process in a Data Guard environment. Without it, synchronization will not take place. Every now and then DBAs do some maintenance work, stop the archiver process but forgot to bring it back online. It’s only human to make such mistakes from time to time and there’s nothing you can do to avoid them.

4.      Critical Primary-Standy OS Configuration Differences

Difference in the configuration of key kernel parameters (open files, semaphores, shared memory, threads) would result in either failure to start the instance on the Standby server in case of failure or in the instance providing an “unexplained” poor service level (stability, performance).

Thousands of things can go wrong every day without you even knowing about it. Testing some of the systems once in a while is hardly enough. The only viable way is to automate DR verification with a proper tool. A tool that will perform a daily read-only scan of your infrastructure and guarantee that availability and protection levels are high, and that no new risks have emerged. A tool that can handle all the above examples and much more. Come visit us at www.continuitysoftware.com and check out our risk free 48-hours pilot for RecoverGuard.


VMware ESX: data loss / downtime risks and how to avoid them

December 23, 2010

Continuing my previous virtualization posts, I’d like to take this time to describe additional examples of what-could-go-wrong-in-my-private-cloud. Here they are:

  1. Replication issues. For instance, a VM which is stored on an un-replicated LUN (or partially replicated set of LUNs); or maybe it is replicated but last synchronization was done months ago? replication was turned off for maintenance and never brought back online….
  2. SAN I/O multipath issues. In this category you may find issues such as dead I/O paths, paths configured with incorrect I/O policies, insufficient number of paths or unequal number of paths between nodes; all these and more could result in suboptimal VM/ESX operation/performance and reduced availability (in other words – more downtime). By the way, see vSphere 4.0’s release note about the use Round-Robin algorithm for I/O load balancing…
  3. Configuration drift between clustered ESX servers. Over time difference between the nodes may arise as it relates to Hardware, Software, patches, Network (etc.). These differences would result in different levels of stability, availability and performance depending on the node on which the VM is currently running.
  4. Image Consistency (aka point-in-time copies). Specific solutions and/or procedures must be applied in order to guarantee that a snapshot taken for a VM is consistent and usable. This may include different techniques of I/O freeze – such as Oracle hot backup, VM Suspension, Storage consistency groups and so one.

The award winning RecoverGuard by Continuity Software is a solution that can help you identify and report these configuration errors immediately as they occur; thus dramatically decrease the frequency of downtime events, reduce amount of work around DR testing and significantly improve recoverability.


Does Keeping Your Resume Up to Date Count as a Valid HA and DR Strategy?

August 25, 2010

By Gil Hecht, Founder and CEO, Continuity Software

Again and again, we are reminded of just how business critical high availability (HA) and disaster recovery (DR) capabilities are in today’s highly-competitive, ultra-demanding economy. Yet, an unreasonably high percentage of today’s most well-known and respected business organizations are leaving themselves vulnerable to both natural and manmade IT disasters.

One needs only to review recent headlines to see what I mean. For instance, the 7-hour outage at Singapore’s largest banking network (“Global CIO: IBM’s Bank Outage: Anatomy of a Disaster”) and the American Eagle Outfitters 8-day-long disaster (“Oracle Backup Failure Major Factor in American Eagle 8-Day Crash”).

Clearly, HA and DR is a persistent challenge for many data centers, regardless of industry or size. And, while most data centers have implemented an HA and/or DR strategy, most understand there is no guarantee it will actually deliver. Due to the time and expense involved, it usually gets tested once or twice a year. Then, over the following days, weeks and months, changes are made to the production environment that are not replicated appropriately, and the HA/DR strategy is rendered virtually useless.

On a daily basis, I meet with many extremely experienced and talented data center managers to talk about how to ensure their organization’s HA and DR. Many do privately admit that while HA and DR is a high business priority – from both an internal governance and/or external legal regulations standpoint – they recognize that if they were to experience a true disaster, data and application availability would probably be lost for an amount of time that far exceeds SLA guidelines (if not permanently). In fact, one IT executive joked that his DR strategy was to, “Keep my resume up to date.”

OK, just for the sake of argument… How about an affordable and easy to manage solution that mitigates data protection and high availability risks by detecting gaps and vulnerabilities between your primary production, HA cluster and/or remote DR sites?


Cutting HA/DR costs with analytics

July 12, 2010

Availability, Recoverability and Data Protection are critical to any enterprise. The alternative cost is unacceptable in capital and reputation losses. Thus, significant time, resources and money are allocated to ensure that all business lines are highly available and can be recovered at various circumstances (from accidental file deletion to earthquake). However, the datacenter is constantly changing and despite the huge effort and world-class IT experts, new risks emerge on a regular basis. Traditional mitigation approaches do not salvage and regularly cost in downtime, potential data loss and excessive operations since:
• Discrete data is gathered by discrete systems, but none correlates all required layers: storage, OS, Database, replication, clustering, etc.
• Home-grown data collection and correlation is economically irrational

Using an HA/DR analytics solution, organizations can dramatically increase availability and recoverability levels in one hand and on the other hand – save significant time and money. An HA/DR analytics solution such as RecoverGuard by Continuity Software or DRA by Symantec analyzes thousands of potential risks by correlating configuration of applications, databases, file systems, servers, storage, replication, clustering and “what’s between”. It keeps getting updated regularly and many like to think of it as an “anti-virus for HA/DR”.

So how can an HA/DR analytics solution help cut down HA/DR costs? It’s simple really. Most enterprises are over spending in the three following direct-cost areas:
• Cost of avoidable downtime
• DR testing operations expense
• HA/DR related sub-optimal resource utilization

Let’s explore each of these cost areas.
Cost of avoidable downtime. Unsuccessful cluster failover, single point of failures in storage network/multipath, RAID level issues, risky layout of database files, suboptimal configuration of database vs. file systems vs. storage leading to unacceptable performance… all these and much more can be completely avoided by deploying an HA/DR analytics solution. If an hour of downtime costs 100K, a very serious cost reduction opportunity lies here.
DR testing operations expense. The organization can become aware of DR readiness before actually performing the test and failing over. Thus, only execute a DR drill only after resolving known recoverability issues. By doing so, significant time and resources are spared. Furthermore, by identifying new threats on the spot, as they emerge, it is guaranteed that resolution time and involved manpower are minimal. Last night changes are still fresh and identifying the root cause is easy – unlike when an error occurs in a yearly DR drill and no one remembers the specific change (one of many…) that was performed months before and created the error. Moreover, with the reporting features and in-depth visibility to dependencies between production and DR systems, replication, cluster configuration (and so on), a DR test requires less resources from the various IT teams and much less manual labor from the BCP personnel.
HA/DR related sub-optimal resource utilization. While it is not the main purpose of HA/DR analytics, the data gathered by such tools allows them to identify saving opportunities. Examples of such opportunities around storage saving are allocated but unused devices, old replicas, file system or raw device allocated to database but hardly used and so on. On the storage network side, replication bandwidth can be optimized with detection of excessive replication, swap replication, temp database replication and so on. Naturally hidden saving opportunities exist in other layers as well.

HA/DR readiness verification is a too complex task to be performed manually without the right tools for the job (check out my “BCP is not different than other IT departments” post). Considering the different teams involved, different layers, different products, vendors and the endless details embedded within each unique component, it is practically mission impossible manually. You know when a DR test starts but you don’t really know when it is going to end…and you don’t know whether data is recoverable at any given time and when the next downtime event will hit. Automation is the key to success and significant cost reduction. HA/DR analytics solutions can dramatically increase control over HA/DR readiness and at the same time reduce the costs considerably.


A common risk identified in remote mirroring configuration

January 10, 2010

Remember our LVM mirroring article from a few weeks ago? This time I’d like to take a closer look at one of the potential risks that were described.

The risk signature:

Incorrect Mirror Configuration for DR


The impact

In any event which requires recovering data from the DR site:

  • Recovery will not be possible
  • Data will be lost
  • RPO SLA will be breached
  • Extended downtime and RTO SLA violation

Technical details

In this scenario, the customer is using mirroring in the LVM level (Logical Volume Management) to create a synchronous copy of the database at the DR site.

The source data is stored on SAN volumes located in the production site where the mirror is supposed to be stored on the SAN volumes at the DR site. However, the configuration is erroneous since the mirrored data is partially stored on volumes from the production SAN array (See Image 1: Incorrect Mirror Configuration). In the event of a disaster, no complete copy of the database will be available at the DR site. Recovery will not be possible. The database will have to be recovered from a recent backup, a process which involves – loss of data, RPO violation and due to the nature of recovering from tape – prolonged downtime and RTO SLA violation.

Can it happen to me?

Yes, for various reasons.

First, configuration errors are inevitable in the enterprise datacenter environment which involve thousands of configuration entities such as arrays, disks, physical volumes, logical volumes and so on…

Moreover, such a vulnerability would go unnoticed until recovery is needed since the mirrored copy is not put to use on a regular basis. Last, configuration drift are created overtime. Even if the environment was set correctly in the past, any change applied may endanger the DR solution validity. For instance, expanding the database to new file systems and/or SAN volumes may break the DR mirror if the implementer does not take into account the intentions of the original design and its complexity.

Image 1: Incorrect Mirror Configuration

Think your data centers may have hidden recoverability and downtime risks such as this?  Find out with the risk free 48-hour RecoverGuard pilot scan.

<!–[if gte mso 9]> Normal 0 false false false EN-US X-NONE HE MicrosoftInternetExplorer4 <![endif]–><!–[if gte mso 9]> <![endif]–> <!–[endif]–>Risk signature:  Incorrect mirror configuration for DR

Video: “Configuration Drift? No Problem.”

April 17, 2009

If you are interested in learning more about the problems posed by configuration drift, here’s an interesting video from Symantec’s website:

http://www.symantec.com/connect/videos/configuration-drift-no-problem-symantec-introduces-veritas-commandcentral-disaster-recovery-a


Check out this article: Symantec enters DR testing game

April 7, 2009

by Yaniv Valik
SR DR Specialist, DR Assurance Group

I thought you might be interested in reading this InfoStor article about Symantec’s launch of Veritas CommandCentral Disaster Recovery Advisor, which they have OEM’d from Continuity. There were some stats from Symantec research that caught my eye:

30% of DR tests fail
32% of organizations say testing will impact their customers
21% of those polled believe testing could negatively affect sales and revenue

Interesting reading: http://www.infostor.com/index/articles/display/4378232387/s-articles/s-infostor/s-backup-and_recovery/s-disaster-recovery/s-symantec-enters_the.html


What it takes to develop a bullet-proof HA/DR plan

March 5, 2009

by Gil Hecht
CEO

Today, most IT organizations have virtually no ability to accurately measure how well their high availability (HA) and disaster recovery (DR) plans will respond when needed.  As other posters on this blog have noted, this is primarily due to the complexity and scale of their IT infrastructures as well as the limitations of traditional HA and DR methodologies.

But as we look to the future, IT will continue to be under pressure to ensure business continuity and eliminate the possibility of even the most minute data loss.  Anything less than completely bullet-proof plans will no longer be acceptable.

I believe that automated HA and DR testing and monitoring software will help companies compensate for the shortcomings of current methodologies and deliver the bullet-proof protection companies demand. These solutions routinely check for hidden vulnerabilities and identify problems — in both virtualized and non-virtualized environments — that had previously gone undetected, allowing the issues to be addressed before they impact business operations.

The need for this technology is clearly demonstrated, which means it will not be long before companies have a number of solutions from which to choose. So what should an IT manager look for in an ideal solution? Here are some things to consider:

  • Automatic IT Discovery and Scanning - agentless technology used to scan an IT environment and collect information and configuration data from IT assets.
  • Robust Gap Detection – ability to identify the thousands of possible HA and DR gaps and vulnerabilities – including, but not limited to: interdependencies/mapping between virtual and physical layers, data tampering, data completeness, data consistency, as well as host configuration in heterogeneous environments.
  • Visualization and Reports – provide a clear view of the IT infrastructure configuration and status, as well as the discovered HA and DR gaps. Visualization provides all the necessary information to drill-down into a specific IT asset, understand its protection status, and to resolve a potential gap that was discovered.
  • Customizable Reporting Infrastructure - ticket summaries delivered on a preset schedule, via a predetermined delivery protocol.  Effective notification means fast and accurate issue resolution.  The solution should also integrate with existing configuration management databases (CMDBs) and ticket management systems, in order to maintain one point of contact for monitoring and managing the entire environment. 
  • Optimization Discovery - ability to analyze the information previously gathered to identify IT optimization and fine-tuning opportunities in the infrastructure, in order to maximize utilization and extend ROI.

A few DR testing mistakes to avoid

March 2, 2009

by Doron Pinhas
VP, Field Operations

1. Don’t keep your ECC Server available in production in a DR test.  Storage management tools such as ECC are the main tools used by system and storage administrators to understand and configure the relationship between servers and storage devices. It is common practice not to map all replica devices to the DR servers during normal operations. However, if a valid and current DR ECC environment is not maintained, there may be no easy way to tell how to map thousands of unmapped devices to the appropriate DR servers. In the stress and confusion that accompanies disaster events, this may lead to significantly extended recovery time.

2. When running a DR test, many companies will just confirm the application started or, at best, run one or two transactions before returning to production. The danger with this shortcut is that real system usage is not simulated, so it is impossible to determine if there are underlying problems – database dependencies, for example, or the ability to support the true production load – that could have an impact in a real failover event.


Top data storage products of the year

February 26, 2009

In case you missed it, here’s a link to SearchStorage and Storage magazine’s announcement of this year’s Top Data Storage Product of the Year winers and finalists:

http://searchstoragechannel.techtarget.com/generic/0,295582,sid98_gci1349065,00.html#


Follow

Get every new post delivered to your Inbox.