HEMESH ORAWORLD: Troubleshooting 11.2 Clusterware Node Evictions

Friday 7 February 2014

Troubleshooting 11.2 Clusterware Node Evictions

Troubleshooting 11.2 Clusterware Node Evictions (Note 1050693.1)

Starting 11.2.0.2, a node eviction may not actually reboot the machine. This is called a rebootless restart.

To identify which process initiates a reboot, you need to review below are important files

Clusterware alert log in /log/alertnodename
The cssdagent log(s) in /log//agent/ohasd/oracssdagent_root
The cssdmonitor log(s) in /log//agent/ohasd/oracssdmonitor_root
The ocssd log(s) in /log//cssd
The lastgasp log(s) in /etc/oracle/lastgasp or /var/opt/oracle/lastgasp
IPD/OS or OS Watcher data. IPD/OS is an old name for the Cluster Health Monitor. The names can be used interchaneably although Oracle now calls the tool Cluster Health Monitor
'opatch lsinventory -detail' output for the GRID home
Message files /var/log/message

Common Causes of eviction:

OCSSD Eviction: 1) Network failure or latencies issue between nodes. It takes 30 consecutive missed checkins to cause a node eviction. 2) Problem writing / reading the voting disk 3) A member kill escallation like the LMON process may request CSS to remove an instance from the cluster via the instance eviction mechanisim. If this times out, it could escalate to a node evict.

CSSDAGENT or CSSDMONITOR Eviction: 1) OS Scheduler problem as a result of OS is locked upor execsive amounts of load on the server such as CPU utilization is as high as 100% 2) CSS process is hung 3) Oracle bug

Friday 7 February 2014

Troubleshooting 11.2 Clusterware Node Evictions

Troubleshooting 11.2 Clusterware Node Evictions (Note 1050693.1)

No comments: