Saturday, January 25, 2014

crsd.bin core dumps

Core dump issues sometimes can be notoriously difficult to troubleshoot. I've got a call this morning from one of my customers saying that after a power outage Grid Infrastructure is not able to fully come up on some nodes on their Exadata cluster. After further examining the situation it turned out that crsd.bin binary is simply core dumping upon start up.

Troubleshooting Grid Infrastructure startup issues when nothing is core dumping sometimes could be a chore so what could be more fun when it's not able to fully start due to a major daemon core dumping?

One of the useful things to do when a binary core dumps is to get a stack trace to see which function raised the exception (you can examine the core file the gdb, for example, in order to do that). Let's see what the stack trace holds for us:
Core was generated by `/u01/app/ reboot'.
Program terminated with signal 6, Aborted.
#0  0x0000003ea3e30285 in raise () from /lib64/
(gdb) bt
#0  0x0000003ea3e30285 in raise () from /lib64/
#1  0x0000003ea3e31d30 in abort () from /lib64/
#2  0x0000003ea56bed94 in __gnu_cxx::__verbose_terminate_handler() ()
   from /usr/lib64/
#3  0x0000003ea56bce46 in ?? () from /usr/lib64/
#4  0x0000003ea56bce73 in std::terminate() () from /usr/lib64/
#5  0x0000003ea56bcef9 in __cxa_rethrow () from /usr/lib64/
#6  0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
    aclString=..., useOcr=true, $U7=,
    $U8=, $U9=,
    $V0=, $V1=) at acl.cpp:120
#6  0x0000000000df8672 in Acl::Acl (this=0x4556d440, domain=..., resource=...,
    aclString=..., useOcr=true, $U7=,
    $U8=, $U9=,
    $V0=, $V1=) at acl.cpp:120
#7  0x0000000000df879c in Acl::_ZN3CAA3AclC1ERKSsS2_S2_b (this=0x4556d440,
    $U7=, $U8=,
    $U9=, $V0=,
#8  0x0000000000a4d81e in SrvResource::initUserId (this=0x7f15803d7550,
    $1=) at clsAgfwSrvResource.cpp:204
We can see that the source of the exception is in the Acl::Acl which is then propagated through the standard libraries. Moreover, function SrvResource::initUserId appears in the stack trace as well, which makes you wonder whether there is some issue with some of the resource's Access Control List, in particular with it's user id setting.

Armed with that knowledge you can now sift through the Grind Infrastructure logs in a much more effective way because these logs are notoriously big and "chatty" (I think my worst nightmare is when the database alert log will become like GI alert log thereby making it much less useful). And there we have it:
Exception: ACL entry creation failed for: owner:ggate:rwx
Turned out the nodes which were core dumping were recently added to the cluster and the user ggate, which is the owner of the GoldenGate resource, simply did not exist on these nodes. Apparently that was enough to cause crsd.bin core dumps. Yikes!

No comments:

Post a Comment