What is OS Watcher Utility and How to use it for Database Troubleshooting ?

Oracle OS Watcher (OSWatcher) is a tool to help Remote DBA's to trouble shoot Database performance, Cluster reboot, node eviction, DB server reboot, DB instance Crash related issues and many more.

As we know, OS stats like top, mpstat, netstat plays an important role in Database trouble shooting but there is no way to keep historical date for these stats. Here, OS Watcher is the only rescue for Database Administrator. Suppose Yesterday, There was some performance issue on Database Node but you were not aware about that and when you know that the issue was resolved itself.

Now, DBA can get Database related stats from AWR reports but not OS related stats for last day, To overcome this challenge Oracle introduce OS Watcher utility, which collects OS stats data at a frequency of five minutes and keep it for seven days (default settings). So Now, DBA need not to worry about historical OS stats.

To Trouble shoot Database performance related issues AWR, ADDM and OS Watcher logs are the first place to start for a Remote DBA. Where as for Cluster reboot, node eviction, DB server reboot Alter log files, OS Watcher and System messages (/var/log/messages) plays an important role.

How to Install OS Watcher Utility ?

1. Download tar file from Oracle Support Article "OSWatcher Black Box (Includes: [Video]) [ID 301137.1]"

2. Copy the file oswbb601.tar to the directory where oswbb is to be installed.
3. Extract tar file with “oracle” user

# tar xvf oswbb601.tar

4. Change to oswbb directory created.

5. Start OS Watcher utility using below command.

Example 1:

./startOSW.sh 60 10
This would start the tool and collect data at 60 second intervals and
log the last 10 hours of data to archive files.

Example 2:

This would use the default values of 30, 48 and collect data at 30
second intervals and log the last 48 hours of data to archive files.

Example 3:

./startOSW.sh 20 24 gzip
This would start the tool and collect data at 20 second intervals and
log the last 24 hours of data to archive files. Each file would be
compressed by running the gzip utility after creation.

To stop the OSW utility execute the stopOSW.sh command. This terminates
all the processes associated with the tool.



The default location of OS Watcher files is /opt/oracle.oswatcher/osw/archive. To collect OS Watcher files for a particular day use below command.

# cd /opt/oracle.oswatcher/osw/archive
# find . -name '*13.03.15*' -print -exec zip /tmp/osw_`hostname`.zip {} \;

{where 13- year 03- Month 15-day}

Below are the list of sub folders created under archive folder

-bash-4.1$ ls

osw_ib_diagnostics   oswiostat            oswnetstat           oswps                oswvmstat
osw_rds_diagnostics  oswmpstat            oswprvtnet           oswtop

Troubleshooting using OS Watcher

Recently, Remote DBA face a node eviction issue in a three node Real Application Cluster environment.  To resolve this, we start looking at alter log files for DB and RAC env and OS Watcher logs. In OsWatcher Mpstat values at time of issue are given below

                 CPU   %user   %nice  %sys %iowait    %irq   %soft  %steal   %idle    intr/s
16:27:00     all    2.60    0.00    1.64   46.53    0.01    0.06    0.00   49.16   3088.40
16:27:05     all    0.44    0.00    1.50   17.50    0.01    0.05    0.00   80.50   2397.39
16:27:10     all    0.47    0.00    0.62   12.98    0.02    0.03    0.00   85.88   2361.48
16:27:15     all    1.00    0.00   14.08    5.34    0.01    0.04    0.00   79.52   2097.98
16:27:21     all    1.11    0.00   72.81   25.22    0.02    0.23    0.00    0.61   6164.79
16:27:28     all    0.73    0.00   98.59    0.56    0.02    0.10    0.00    0.00   5348.05
16:27:33     all    0.39    0.00   99.44    0.11    0.02    0.04    0.00    0.00   3578.19
16:30:02     all    0.16    0.00   96.24    2.63    0.00    0.03    0.00    0.93   1688.58
16:30:07     all    1.34    0.00    1.79   60.13    0.02    0.09    0.00   36.62   5086.03
16:30:12     all    0.99    0.00    0.49   80.87    0.03    0.07    0.00   17.56   3650.30

From the above data, this is clear that, All CUP were 100% utilized which cause system resource at crunch and system was rebooted. Now, DBA needs to look what case this high system utilization.

To troubleshoot this, DBA check top output at time of issue from OS Watcher logs in the top folder. There were around 200 Parallel process running at time. Then I cross check these process with another Top command output at good time, and it was clear that, these process was not running at good time. In conclusion, high number of parallel processes cause this issue.

So, the problem and reason is clear with the help of Oracle OS Watcher tool. This is a simple real life scenario to understand how OS watcher can help remote DBA to resolve issues. I have also mentioned steps for detailed analysis of OS Watcher logs.

Please share your views about this article.