Typically, when a bug results in a database outage, affected systems can simply be recovered from backups. But as InfoWorld has learned, this particular collection of Oracle issues could incur database outages that take considerable time and effort to correct.
[ Subscribe to the InfoWorld Daily newsletter and follow InfoWorld on Twitter to make sure you don't miss an article. | See Editor in Chief Eric Knorr's post to the Oracle community: "Caling all Oracle customers." ]
According to a source who asked not to be named, "This is a very real problem for us. We're spending considerable time and money to monitor, plan, and address it where we can."
In the process of reporting this story, we've conducted our own tests, verified information with sources we believe to be reliable, and consulted extensively with Oracle itself, which has given credit to InfoWorld for bringing the security aspect of the problem to the company's attention.
After we notified Oracle of our discoveries and following several technical discussions, the company requested that we hold this story until it had time to develop and test patches that addressed the flaw. In the interest of security for Oracle users, we agreed. Those patches are available at 1 p.m. PT today as part of the Oracle Critical Patch Update for January 2012.
To be clear, the security aspect of the flaw could make any unpatched Oracle Database customer vulnerable to malicious attack. Yet another, more fundamental aspect could pose a special risk only to large Oracle customers with interconnected databases. Both stem from a mechanism deep in the database engine, one that most Oracle DBAs seldom deal with day to day.
The heart of the matterAt the core of the issue is the System Change Number (SCN) in Oracle. This is a number that increments sequentially with every database commit: inserts, updates, and deletes. The SCN is also incremented through linked database interactions.
The SCN is crucial to normal Oracle database operation. The SCN “time stamp” is the key to maintaining data consistency in Oracle, allowing the database to respond to every user’s query with the appropriate version of data at every point in time. It functions as the database's clock, so to speak. And like time, the SCN cannot decrement. It must always tick forward.
When Oracle databases link to each other, maintaining data consistency requires them to synchronize to a common SCN. This is necessarily the highest SCN carried by any participating Oracle database instance because the SCN clock cannot run backward -- so database linking causes the SCN in many databases to jump during normal operations. And only very basic permissions are required to make a connection that can cause one database to increment the SCN on another.
The architects of Oracle's flagship database application must have been well aware the SCN needed to be a massive integer. It is: a 48-bit number (281,474,976,710,656). It would take eons for an Oracle database to eclipse that number of transactions and cause problems -- or so you might think.
In addition to that hard limit, there's a soft limit imposed by Oracle itself to ensure the SCN value at any given moment is not unreasonably high, which would indicate a database malfunction. If the soft limit is exceeded, the database can become unstable and/or unavailable. And because the SCN cannot be decremented or otherwise reset, the database cannot simply be restored from a backup and replaying the logs, since they will necessarily carry the SCN elevation along with those transactions. An analogy can be made to the effect of the Y2K bug on an unpatched system as the clock struck 12 a.m. on Jan. 1, 2000.
The soft limit derives from a very simple calculation anchored to a point in time 24 years ago: Take the number of seconds since 00:00:00 01/01/1988 and multiply that figure by 16,384. If the current SCN value is below that, then all is well and processing continues as normal. To put this in simple terms, the calculation assumes that a database running constantly since 01/01/1988, processing 16,384 transactions per second, cannot exist in reality.
In reality, it doesn't. But if you alter reality, it does -- and there are several ways a breach of the SCN soft limit may occur.
The backup bugOne recent example comes in the form of a bug. Oracle databases have a feature that supports hot backups. This allows a database administrator to run a command to facilitate a backup of a live database. It's a handy function that can be run easily: 'ALTER DATABASE BEGIN BACKUP' is the command you need. You can then back up the live database until you issue an 'ALTER DATABASE END BACKUP' command that returns the database to normal operating mode. This makes it very simple for a DBA to take care of live backups of production databases.
The problem is that, due to a coding flaw, issuing the 'BEGIN BACKUP' command causes the SCN for the database instance to increase dramatically, so that the SCN continues to increase at an accelerated rate even after the 'END BACKUP' command is given. Thus, performing a hot backup can increase the SCN value by millions or billions quite quickly -- and that elevated growth continues unabated. In most cases the SCN limits are so far out of reach that the occasional jump in number isn't a cause for concern. Admins are unlikely even to notice the problem.
The combinationBut when you mix the hot backup bug with large numbers of interconnected databases in a massive Oracle implementation, the combination can result in widespread SCN elevation. Some large Oracle customers have hundreds of database servers running hundreds of Oracle instances interlinked throughout the infrastructure. Each one may be tasked with a core service and a few lesser functions -- but like Kevin Bacon and the rest of Hollywood, nearly all are linked together somehow, through one, two, four, or more intermediary servers.
With all of these servers interconnected, their SCNs become synchronized at one point or another. Collectively they might be pushing more than 16,384 commits per second, but certainly not since 01/01/1988, so the SCN soft limit isn't problematic.
But what if a DBA on one part of this Oracle network runs the flawed hot backup routine Suddenly, the SCN on his Oracle instances shoot up by, say, 700 million, and this number soon becomes the new SCN for all interlinked Oracle instances throughout the organization. Some time later, another hot backup command is issued by a DBA on the other side of the company. The SCN shoots up a few hundred million this time, also synchronized across all connected instances over time.
With the issuance of a few commands, the SCN of the entire Oracle database infrastructure has increased by hundreds of millions or even hundreds of billions in a short period. Even database instances that connect only occasionally, or for a weekly or monthly batch run, might see their SCN numbers jump by trillions.
In such a scenario, it may be only a matter of time before enough backup commands are run to cause the SCN to eclipse the soft limit -- at which point every interlinked Oracle database server suffers some significant problem, refuses connections from other servers, or simply crashes.
Oracle released a patch to fix the arbitrary SCN growth rate bug in the hot backup code before InfoWorld began researching this story. The backup bug is listed as 12371955: "High SCN growth rate from ALTER DATABASE BEGIN BACKUP in 11g." If you have not already done so, Oracle recommends that you install this patch immediately.
The saboteur But the risk of incrementing the SCN via the backup bug is not the only cause for concern. Perhaps the most important part of our finding is that the SCN can be incremented by anyone who can issue commands on an interconnected database.
Say there's a low-privilege reporting database system that has a read-only database link to a high-privilege database somewhere down the line. All someone has to do is issue a few administrative commands on that low-privilege database to raise the SCN value by, for example, 1 trillion; the next time the high-privilege database receives a connection from that box, its SCN increases to the higher level. It's simple enough to verify the current SCN value, so running a command to push it a few ticks shy of the soft limit is trivial.
Using this technique, a bad actor could potentially cause a systemwide Oracle database communications failure, a shutdown, or a crash with only a few commands on a sideline server -- or even craft a bit of code to mimic a connecting instance. It may be possible, though more challenging, to cause the same problems though a SQL injection attack on a vulnerable application.
Our discovery of this attack method and our communication of this information to Oracle resulted in Oracle's request that InfoWorld hold this story until today, when a patch could be made available. In addition to the method we discovered, we learned of two other, similar means of incrementing the SCN by searching the open Web, although these required higher privilege levels. The patch blocks all three methods that we are aware of, though there may be others.
The realityThe SCN is a moving line that cannot be crossed. The line moves up by 16,384 every second; as long as the SCN growth rate is slower, all should be well.
But what happens when your SCN moves closer and closer to that line due to the spurious jumps caused by the backup bug, a simple mistake by an admin, or other means How do you deal with this impending problem
The answer: Shut down the database servers for a while so that the number stops incrementing.
Plainly put, this means that every single interlinked Oracle instance across an entire company will need to be shut down just to move a bit further away from the line. If only a handful are shut down, they will very quickly jump right back to the high SCN whenever they connect with other Oracle instances. The only way to ensure this problem is completely eradicated is to shut down every affected system: backup servers, replicas, everything.
Not only that, but while they're down, admins will need to scour the infrastructure to be absolutely certain that no affected Oracle systems have escaped remediation. If they miss even one instance, they will have to perform the complete shutdown again.
Then there's the issue of how long to shut down. If you shut down every instance for a week, that would buy you about 10 billion ticks away from the line of no return. How many businesses would entertain the idea of shutting down their database systems for that long
While they're down, the SCN could potentially be "reset" -- but only by dumping out each database, dropping the database, and importing the dump into a fresh database. This would have to be done for every database running on every database server across the entire organization, all at the same time. With databases routinely in the multiterabyte range, this will take a while.
Again, only very large customers with many interconnected Oracle databases would be likely to run a significant risk of being affected by this problem. But the larger the Oracle environment, the longer this restoration would take. Typically, large organizations have the least tolerance for downtime.
The fixUntil recently, aside from the backup bug fix, Oracle's only response to the SCN elevation issue -- as far as we've been able to determine -- has been to release a patch that extends the SCN calculation to 32,768 times the number of seconds since 01/01/1988, doubling the rate at which the soft limit increases. Oracle even made it modifiable, so admins can further increase the multiplier.
If this patch is applied to an Oracle instance, it will definitely increase the time the interlinked databases can run before hitting the SCN limit. However, it also introduces new variables.
Part of the problem is that you can't patch every system at once. Additionally, if you have a patched system with an elevated soft limit -- based on a multiplier of, say, 65,536 -- the SCN on that system could be higher than the SCN on an unpatched system using the original 16,384 multiplier, causing the unpatched system to refuse the connection or encounter another problem as it fails the soft limit check. There's also the issue of servers running older Oracle versions that may not have a patch available.
Furthermore, if this patch is a default inclusion in the next Oracle release, admins may suddenly discover that their existing servers are unable to communicate with new or upgraded servers that use the new, higher SCN calculation method, should the new servers have a sufficiently elevated SCN. If the SCN values line up just right, it's possible that a patched system could connect and set the SCN of an unpatched system just shy of the soft limit, causing the unpatched system to hit the limit through its own processing.
As mentioned, the risk of such a scenario playing out is very small except in large, highly interconnected environments where an elevated SCN can flow like a virus from server to server. But once a server is infected, there's no going back. Also, if the SCN is incremented arbitrarily -- or manually, with malicious intent -- then that 48-bit integer hard limit is suddenly not as astronomical as it might seem.
The community reactionInfoWorld contacted a number of Oracle sources for this story. Several lacked familiarity with the problem; others noted that Oracle licensing agreements prevented them from commenting on any aspect of their product usage. The head of the Independent Oracle User Group (IOUG), Andy Flower, offered this statement on the record: "This bug with the SCN number is obviously something our membership would be concerned about -- and will need to consider what sort of challenges that may present and if any mitigation strategies will be needed. I'm sure it will be a topic that some of our larger members will probably get together and discuss."
Among the Oracle experts we spoke with, Shirish Ojha, senior Oracle DBA for Logicworks, a hosting and private cloud service provider, was the most familiar with SCN issues, including the bug numbers associated with the problem. He acknowledges that although few Oracle environments are likely to encounter the problem, the consequences may be severe. "If there is a dramatic jump in SCN due to any Oracle bug, there is a minimalistic probability of breach of this seemingly high number," said Ojha, who has earned the coveted title of Oracle Certified Master. "If this occurs in a high-transaction and large interconnected Oracle architecture, this will render all interconnected Oracle databases useless in a short period of time."
Ojha continues: "If this occurs, even though its probability is low, the potential [financial] loss ... is very high." By definition, he said, the problem has the potential to affect only large Oracle customers. But "once the SCN limit is reached, there is no easy way to get out of the problem, other than shutting down all databases and rebuilding databases from scratch."
Anton Nielsen, the president of C2 Consulting and an Oracle expert, focused on the potential risk of malicious attack using an elevated SCN: "In theory, the elevated SCN attack is similar to a DoS attack in two significant ways: It can bring a system to its knees, rendering it inoperable for a significant period of time, and it can be accomplished by a user with limited permissions. While a DoS can be perpetrated by anyone with network access to a Web server, however, the elevated SCN requires a database username and password with the ability to connect."
The Oracle reactionWhen we first contacted Oracle about the SCN issue, Mark Townsend, vice president of database product management, offered this reaction to our discovery of a low-privilege method to arbitrarily increase the SCN: "The way that you're putting these [issues] together is nothing that we've seen ... we need to understand what it is that you're doing to raise the SCN by trillions. Obviously I need to have some time to have the dev people look at that. "
After much discussion and exchange of technical data, Oracle acknowledged that there were ways to increase the SCN at will. Referring to one method, Townsend said, "This is an undocumented, hidden parameter, so it was never intended for customers to discover and use this."
However, we pointed out that there were several other methods that could be used; we sent those to Oracle as well.
Oracle's remedy for these security vulnerabilities is in the series of patches present in the just-released January Oracle Critical Patch Update. These patches remove the various methods of arbitrarily increasing the SCN and implement a new method of protection, or "inoculation," as Townsend put it, for Oracle databases.
We haven't had time to exhaustively test these patches, nor do we know exactly what the "inoculation" patch does. In fact, without extensive testing, we cannot provide further details other than it claims to prevent connections from databases with sufficiently high SCN values. We do not know, for example, whether this could potentially cause problems for affected systems that need to connect with other systems.
These patches are being released for only the more recent versions of the database: Oracle 11g 11.1.0.7, 11.2.0.2, and 11.2.0.3, as well as Oracle 10g 10.1.0.5, 10.2.0.3, 10.2.0.4, and 10.2.0.5. Older versions will continue to be affected. Given the sheer number of Oracle installations older than 11.2.0.2.0 and 10.1.0.5, a large installed base will remain vulnerable.
The next stepsThe next step for Oracle admins is to inspect the SCN values of their databases. Following that, the application of the hot-backup patch is crucial, as are the follow-up patches that address the ability to arbitrarily increase the SCN value through administrative commands. However, since patches exist only for newer versions of the database, there may be no other option for older databases than to upgrade.
It's also critical that Oracle admins take great pains to prevent any unpatched Oracle database servers from connecting to any other Oracle databases within the infrastructure. This will present quite a challenge in large deployments that utilize many different Oracle versions, but it will be necessary to prevent spurious SCN growth. It appears that keeping SCN values in check will be an ongoing exercise for some Oracle shops, requiring monitoring and careful inspection of new installations down the road.
We hope that Oracle's patches and the increased visibility of this issue will provide Oracle shops with fair warning of problems they may face and arm them with at least some protection against a potentially large problem.
This article, "Fundamental Oracle flaw revealed," was originally published at InfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest business technology news, follow InfoWorld on Twitter.
Read more about security in InfoWorld's Security Channel.