Life Saver action Plan when Exadata Production down?


Switch to Data Guard if you don't have Reboot Everything, Yes you got it right. Reboot all machines one by one.

In Which Order?


First all Database Nodes, Second all CELL Servers one by one and IB Switches at last.

What is the Impact?


Since, everything is down no impact.

Let's see in detail, Exadata is an expensive machine mostly used for Production environments. There could be cases when one DB node goes down, one Cell Server not reachable. Exadata DBA can manage this as we have redundancy at every level. However, it will have an impact on performance etc, but environment is up and running. Most of the times issue is with DB Nodes specially with Cluster. You can reboot Node or restart CRS on and see if this fix the issue. 

Here, I am talking about scenario when production is down. Suppose you have Oracle Exadata Database Machine Quarter Rack. X5-2 Quarter Rack has 2 Database Nodes, 3 Cell Saver and 2 IB Switches. So, what is Production down scenario:-

1. Both Database Nodes are down.
2. More than one Cell Server is down, disk groups are dismounted.
3. Both IB Switches are down. RAC interconnect is not working, everything is down.

When you are stuck in this situation, you want to fix the problem first than doing Root Cause Analysis. As you are reading ahead, I would suggest to check if private interconnect is working. The reason interconnect first is, In most of the cases interconnect cause complete outage. You should check DB node to DB node and DB Node to Cell Server connectivity.

How to find interconnect IP addresses for Database Nodes and check inteconnect?


[root@exadata01 ~]# cat /u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log | grep interconnect

cluster interconnect IPC version: Oracle RDS/IP (generic)

Private Interface 'ib0:1' configured from GPnP for use as a private interconnect.

  [name='ib0:1', type=1, ip=169.254.37.173, mac=80-01-68-cc-fe-80-00-00-00-00-00-00-00-10-00-00-00-00-00-00, net=169.254.0.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]

Private Interface 'ib1:1' configured from GPnP for use as a private interconnect.

  [name='ib1:1', type=1, ip=169.254.186.7, mac=80-01-68-cd-fe-80-00-00-00-00-00-00-00-10-00-00-00-00-00-00, net=169.254.128.0/17, mask=255.255.128.0, use=haip:cluster_interconnect/62]

cluster interconnect IPC version: Oracle RDS/IP (generic)

  cluster_interconnects    = "192.168.10.11:192.168.10.12"

cluster interconnect IPC version: Oracle RDS/IP (generic)

  cluster_interconnects    = "192.168.10.11:192.168.10.12"
 

This machine is using 192.168.10.11 and 192.168.10.12 and private interconnect. Since, this is Exadata we use Inifiniband as interconnect. Use rds-ping instead on ping to check if this machine is reachable from another node and second node is reachable from first node. In normal RAC systems you can use ping command and finding interconnect IP is same.

[root@exadata02 ~]# rds-ping 192.168.10.11

   1: 88 usec   

   2: 102 usec

   3: 94 usec

   4: 89 usec

   5: 105 usec

[root@exadata02 ~]# rds-ping 192.168.10.12

   1: 102 usec

   2: 71 usec

   3: 77 usec

[root@exadata01 ~]# rds-ping 192.168.10.42

   1: 85 usec

   2: 80 usec

   3: 73 usec

   4: 88 usec

   5: 79 usec

   6: 81 usec

[root@exadata01 ~]# rds-ping 192.168.10.13

   1: 101 usec

   2: 141 usec

   3: 126 usec

   4: 72 usec

Now, let’s check CELL interconnects. If Cell Servers are not reachable from DB nodes you can't access storage which have voting disk and OCR, In this case CRS can't start on DB Node. 

[root@exadatacell01 ~]# cat /var/log/oracle/diag/asm/cell/exadatacell01/trace/alert.log | grep -5 'CELL communication'

Successfully allocated 4864 MB for Storage Index. Storage Index memory usage can grow up to a maximum of 9339 MB.

CELLSRV configuration parameters:

Memory reserved for cellsrv: 93398 MB Memory for other processes: 2900 MB

_cell_auto_dump_errstack=FALSE (default=TRUE)

_cell_fc_persistence_state=WriteBack

Successfully allocated 4864 MB for Storage Index. Storage Index memory usage can grow up to a maximum of 9339 MB.

CELL communication is configured to use 2 interface(s):

    192.168.10.5

    192.168.10.6

IPC version: Oracle RDS/IP (generic)

IPC Vendor 1 Protocol 3

  Version 4.1


Get other cell private interconnect IP addresses also. Now check if private interconnect is working or not? This Cell can ping other Cell servers and DB nodes.

 [root@exadatacell02~]# rds-ping 192.168.10.5

   1: 53 usec

   2: 41 usec

   3: 37 usec

   4: 39 usec

[root@exadatacell02~]# rds-ping 192.168.10.6

   1: 40 usec

   2: 52 usec

   3: 35 usec

Check CELL can ping DB Nodes as well.

[root@ exadatacell02~]# rds-ping 192.168.10.13

   1: 45 usec

   2: 52 usec

   3: 102 usec
We don't see Private Interconnect issue. So we can skip IB switches reboot. Here, Action Plan is
First Reboot Database Nodes then all CELL Servers one by one

If you want to investigate further,

First, check private interconnect which we have already checked.
Second, should be CRS alert logs, OS Messages files, ASM Alert logs on Database Node
Third, Cell Server alert logs and Cell server Messages files on Cell Server.

 I have noticed many cases when Cell Server reports RS-7445 or RS-600 in alert logs which blocks whole storage and ASM or DB Instances doesn't start. There could be many reasons for production down. This post is about action Plan when production is down, I will discuss how to investigate in coming posts.

You have followed action Plan but product is still down, Don't wait open a service Request with Oracle Support. There could be few more case where production is down. Don't forget to share your experience and feedback about this post.

No comments:

Post a Comment