Wednesday, November 10, 2010

How to clear NFS locks during network crash or outage for Oracle datafiles.

Symptoms :

Database cannot be opened because old locks still exist on the filer stored data files.

Error : ORA-27086

Cause of this problem :

Data files that were open during network episode left in NFS locked state. Oracle cannot open the locked files

The lock recovery manager (NLM) in the Linux kernel uses uname -n to determine the host name while the rpc.statd process (NSM) uses gethostbyname() to determine the client's name. If these do not match, the recovery process will not work

Solution :

This solution indicates the recovery steps detailed herein should be taken in case Oracle is hung, but that PROPER ORACLE TROUBLESHOOTING AND SUPPORT METHODS SHOULD BE FOLLOWED FOR ANY DATABASE-HUNG ISSUES, independently of NetApp.

Oracle's database product does not typically hang after a network crash.

Summary of Corrective steps :

1) Shutdown Oracle databases

2) Unmount database volumes

3) Kill lockd/statd processes on UNIX host

4) Clear locks on filer

5) Remove the NFS lock files on the host.

6) Restart lockd/statd processes on UNIX host

7) Remount the database volumes on the UNIX host

8) Restart databases

Detailed Procedure:

1) Shutdown all Oracle databases being run by the affected server.

Issue the Oracle shutdown immediate command and verify that no database processes are still running by issuing the UNIX command ps -ef |grep -i ora on the UNIX database host.

If database processes are still running issue the Oracle shutdown abort command and use the UNIX command ps -ef | grep -i ora to verify that no database processes are still running.

If database processes are still running do the following from the UNIX command line:
ps -ef | grep ora to get process id's (pid's) of remaining Oracle processes

kill -9 pid for each remaining Oracle process.

2) Unmount all database volumes using the UNIX umount command.

3) Kill statd and lockd processes on the UNIX host in the order specified below:
Determine the process id's (pid's) of statd and lockd from the UNIX command line:

ps -ef |grep lockd

ps -ef |grep statd

kill [lockd_process_id]

kill [statd_process_id]

4) Remove locks from filer

Execute the following from the filer command line:

filer> priv set advanced

filer> sm_mon -l (In many cases specifying the host name does not clear all the affecting locks, so the recommendation is to NOT specify a hostname)

Delete all files in the filer's "/etc/sm" directory. (Remove the files only. Do NOT remove the "/etc/sm" directory itself.)

If the filer is running Data ONTAP 7.1 or higher run 'lock break -h [hostname]' to release any locks that still exist.

Note:
If the 'lock break -h [hostname]' doesn't work, ensure that the server name that you are entering is not the same as the one that the filer has.

If the locks are not cleared, run 'lock break -p nlm' (This also requires Data ONTAP 7.1 or higher). This will clear all the NFS locks on the filer. This will not sever any NFS connections, it will simply force the processes to re-request the locks for the files they are writing to.

5) Remove the NFS lock files on the host.

From TR-3183 - Using the Linux NFS Client with Network Appliance Storage,
rpc.statd uses gethostbyname() to determine the client's name, but lockd (in the Linux kernel) uses uname -n.
By changing the HOSTNAME= fully qualified domain name, lockd will use an FQDN when contacting the storage. If there is a lnx_node1.iop.eng.netapp.com and also a lnx_node5.ppe.iop.eng.netapp.com contacting the same NetApp storage, the storage will be able to correctly distinguish the locks owned by each client. Therefore, we recommend using the fully qualified name in /etc/sysconfig/network. In addition to this, sm_mon -l or lock break on the storage will also clear the locks on the storage which will fix the lock recovery problem.

Additionally, if the client's nodename is fully qualified (that is, it contains the hostname and the domain name spelled out), then rpc.statd must also use a fully qualified name. Likewise, if the nodename is unqualified, then rpc.statd must use an unqualified name. If the two values do not match, lock recovery will not work. Be sure the result of gethostbyname(3) matches the output of uname -n by adjusting your client's nodename in /etc/hosts, DNS, or your NIS databases.

6) Start the UNIX statd and lockd processes from the UNIX host command line in the order specified below:

/usr/lib/nfs/statd

/usr/lib/nfs/lockd

7) Mount the database volumes on the UNIX host.

8) Start the database(s) and test for availability.

1 comment:

Unknown said...

Thanks for the step-by-step tips! I know for a fact that network crashes are unavoidable because it has happened to me a few times. That's why I followed my Houston network support agent's advice to do disk mirroring. It is quite an investment to purchase multiple hard disks, but believe me, it can be greatly useful. It's always better to have your computer prepared for crashes.