Friday 27 May 2011

Last Sunday

Our production databases are backed up with one of two schedules.  Most days our backups are sent offsite and retained for 3 months before being returned.  But on the last Sunday of each month, our backups are archived for a longer period of time.  Therefore the RMAN backups and Veritas catalogs need to be retained for up to 7 years!

Us Linux DBAs love cron for scheduling jobs. These fancy software schedulers with their dependencies and clever logic have never really replaced a really dense, difficult to read crontab file. And that's just the way we like it in order to keep our jobs!

Now, crontab doesn't have a built-in facility for deciding whether a particular day is the 'last' or 'first' of the month (or quarter, or year).  Our current backup regime relies on a separate cron job to create a lock file, earlier in the day, then each individual backup script checks for existence of that file before deciding which tape schedule and retention policy to use.

I thought there was a slightly more elegant way of achieving this.  Here's what I came up with:

We have a simple script called islastsunday which compares the current date with the date of the last Sunday in the month (from the cal command).

#!/bin/bash
[ $(cal | awk '{print $1}' | tr -s '[:blank:]' '\n' | tail -1) = $(date +%d) ]
In our crontab file we have two entries for our backup, as follows:
# Check if day is LAST SUNDAY of Month. If NOT, run the regular backup
30 00 * * * islastsunday || (rman-job.sh level_0_tape ORCL > level_0_tape_ORCL.log 2>&1; rman-job.sh level_0_tape ORCL2 > level_0_tape_ORCL2.log 2>&1)
# Check if day is LAST SUNDAY of Month. If it IS, run the Monthly backup
30 00 * * * islastsunday && (rman-job.sh level_0_tape_LS ORCL > level_0_tape_LS_ORCL.log 2>&1; rman-job.sh level_0_tape_LS ORCL2 > level_0_tape_LS_ORCL2.log 2>&1)
Both schedules run at the same time, but only one of them will actually execute, depending upon the day of the month. The actual backup commands are bracketed, so that they are ALL dependent upon the islastsunday script, and will all run with the same retention policy, even if the backups continue to run after midnight (i.e. in Monday!).

Friday 20 May 2011

Cold Backup RMAN Fun and Games

Here's an RMAN oddity.  If anyone could explain it, I'd be grateful.

We were moving a database to a new server.  Same OS, same version of Oracle (10.2.0.4).  The database is in Archivelog mode, but to keep things easy, we decided to do an offline, consistent backup.

So, we shutdown the database, then mounted it and performed an RMAN backup to a flash_recovery_area on a shared NFS filesystem:

backup device type disk database plus archivelog include current controlfile;
The backup worked fine. So we logged on to the new server, made sure that we could see the NFS-mounted FRA, and issued the Duplicate command.

run {
allocate channel d1 type disk;
allocate channel d2 type disk;
allocate channel d3 type disk;
allocate channel d4 type disk;
allocate auxiliary channel da1 type disk;
allocate auxiliary channel da2 type disk;
allocate auxiliary channel da3 type disk;
allocate auxiliary channel da4 type disk;
duplicate target database to CATDB;
}
Oddly, it didn't work.  It returned an error saying that a backup could not be found.

Starting restore at 20-MAY-11

released channel: da1
released channel: da2
released channel: da3
released channel: da4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 20/05/2011 15:25:38
RMAN-03015: error occurred in stored script Memory Script
RMAN-06026: some targets not found - aborting restore
RMAN-06100: no channel to restore a backup or copy of datafile 1
RMAN-06100: no channel to restore a backup or copy of datafile 2
etc...
If we tried using automatic channel allocations for the restore, it still did not find our disk backup, but tried restoring older tape backups.  Not what we wanted.

So, we tried backing it up again.  Backing up just the archivelogs.  Just the database.  Nada!  The same error every time we tried to Duplicate.

In desperation, I tried opening up our original DB and running an online backup.  Luckily the database was small and the backup quick.  I then shut it down and re-mounted to prevent the users from changing anything.

This time, upon trying the Duplicate command (this time with an "until sequence" inserted) it all worked as expected!

So it looks like a wee bug.  If you've got a database in Archivelog mode, but you perform an offline backup, RMAN seems to get confused and have trouble deciding which backups to restore.

We know it's not a general problem with offline backups, as we've duplicated DBs in Noarchivelog mode on this system before.