Monday, December 29, 2008

Don't pull that plug

It's never a good thing when any type of server suddenly looses power. I have been told it can be very bad when it happens to an i5/OS box. Up until recently we have been lucky. Our collocation has a bad track record of power outages which takes down one of our biggest i5/OS servers. In the past it has come back up with no issues but the most recent one caused some headaches.

On the 23rd the power went out again. When the Domino partitions came back up we started to see some corruption issues. Since these servers are members of a cluster we just recreated the replicas that were corrupt. Minor pain but no biggie. Then our system engineer started noticing issues with BRMS. Jobs were still showing as active that were not actually running. This started showing up on the Domino servers that were supposedly being backed up by BRMS with the phantom job. One Domino server in particular started crashing and having strange behavior. When I tried to do a manual shutdown and restart the server took a while to come down and then would not come back up.

IBM recommeded an IPL to clear out the phantom BRMS job so we ended all the Domino servers and IPL'd the system. This cleared the BRMS job but the one Domino server still would not start. It eneded up being 2 issues with this server.

1) The Directory= entry in the NOTES.INI was messed up. It had a bunch of symbol charachters after the = sign. We have seen this on occasion with Domino on iSeries. The solution is to put the directory path info back in.

2) The transaction log files were corrupt. The Domino server would stall at the point where it was trying to read the log files when starting. I renamed the current trans log folder and created a new empty one. When Domino restarted it created new log files and ran a consistency check against all dbs. Note: it actually took 2 restarts to get the log files built again. The first start ended with a panic telling you to restart the server again to finish the process. Thanks for the info.

This power outage wasn't as smooth as the previous ones but at least Domino was able to repair itself without a full restore. It also helps that these are clusted servers so any data that was in the corrupt trans log files will replicate back from the other server.

Now to find a new collocation provider. The funny thing is they were upgrading their UPS's when this latest outage happened.


Jim Casale said...

Your story reminds me of a challenge I gave my former manager when they wanted to look at Exchange 2007. I said put half the company owners on Exchange 2007 and the other half on Domino. I would stand behind the Domino server, he would stand behind the Exchange sever. On a count of three, let's pull the plug and see who recovers faster. Needless to say he didn't think it was funny and he wouldn't try it.

Kevin Kanarski said...

Good one Jim. I'll have to use that scenario here.