December 19th, 2002, 04:59 PM
help - DNS mystery
I had something rather interesting happen to both of my internal DNS boxes last night at almost the exact same time. Both of these boxes are Solaris with the same build and patches. At a little after 4 AM PST I get the following in the logs -
Dec 19 04:08:24 nms2 named: [ID 866145 daemon.crit] message c:809 REQUIRE(*rdataset == 0) failed
Dec 19 04:08:24 nms2 named: [ID866145 daemon.crit] exiting (due to assertation failure)
the other box was time stamped 5 seconds later with the same messages. Now I'm not a huge DNS person (that's usually our network folks but since I was the only one in this morning I got to "fix" it) so I don't really know what this all means. What is the assertation failure and why did it affect both boxes at the same time? I looked through my IDS logs and firewall logs for anything around that time and nothing seemded to correspond with those times stamps so I'm not sure if someone "attacked" them or if they just both failed for the same reason at the same time. Neither of these boxes have a connection to the outside world and are only our internal DNS servers, we have external DNS servers for outside communication.
I've determined that the named daemon stopped on both servers and all I had to do was restart it, but that doesn't tell me WHY it happened in the first place.
Any ideas as to what could have been up with these machines?
December 20th, 2002, 12:24 AM
no clue what the issue is, or why the both quit at the same time, but, the general consensus on google groups is that you need to upgrade BIND.
December 20th, 2002, 04:22 PM
heh, thanks for the response. As a matter of fact I just checked and both of those boxes are running BIND 9.1.3 so I guess todays project is to upgrade them to 9.2.1 and see if that fixes them.
They both went down at around midnight PST last night and nothing seems to have triggered it. I have a couple of different boxes monitoring traffic to and from those boxes and doing content control and I saw nothing out of the ordinary.
I'll upgrade, keep an eye on them and post again if it was successful or not.
December 20th, 2002, 08:13 PM
hmmm, well after looking at those two boxes and checking their BIND level I see that they are already at 9.2.1, so I'm not real sure what's going on. That link you provided said that we should pretty much go to 9.2.1 to "fix" the problem, but since I'm already at that level and still have that problem I'm not real sure where to go to from here. :/
December 23rd, 2002, 04:03 PM
well late last week I made a script that looks to see if the named daemon is running or not. If it isn't running it starts it back up and then sends me an email alert about it.
I have had 32 emails over the weekend that this thing has stopped and restarted. I haven't had ANY DNS problems in 7 months until last week. Perhaps a roll back to a pervious 9.x BIND would help?
December 23rd, 2002, 07:07 PM
You might want to try to run 'truss' on the named process, output it to a file on a partition that can hold ALOT of data, until it crashes (if you know it only happens after a few hours, try to wait a little while to minimize data). You may at least be able to see what it was trying to do when it crashed. If I lost you on that, let me know and I can give a step by step.
Secondly, I would be interested to know if you have something like snort laying around that could help monitor network traffic to see if you are being attacked (a result of BO attacks that fail is sometimes a crashed daemon). If not, you could maybe setup a verbose snoop to see the content of the packets (same as above, output to a file on a big parition).
Something like 'snoop -d <interface> -V -x 0,65535 -r port 53 and udp > /bigpartition/snoop'' should work nicely, at least enough to keep an eye on traffic.
Have your script note when your named process dies, then kill off the above and take a look.
Hope this helps a little bit.
EDIT: I want to make sure you know about truss, it is incredibly powerful and is very helpful in letting you know what caused problems with processes. With it, you can very often see if was for example looking for a library and couldn't find it and then core dumped, or that it tried to access memory but there was none available. Very very helpful, here is a brief snippet from man truss:
Alternatively, you could find the core dump file, move it to a machine with gcc on it, and use the debug tool to look at the ASM that it was trying to run when it crashed (IMHO a little harder to do than the truss, but still a viable option).
truss - trace system calls and signals
The truss utility executes the specified command and pro-
duces a trace of the system calls it performs, the signals
it receives, and the machine faults it incurs. Each line of
the trace output reports either the fault or signal name or
the system call name with its arguments and return value(s).
System call arguments are displayed symbolically when possi-
ble using defines from relevant system headers; for any path
name pointer argument, the pointed-to string is displayed.
Error returns are reported using the error code names
described in intro(3).
Optionally (see the -u option), truss will also produce an
entry/exit trace of user-level function calls executed by
the traced process, indented to indicate nesting.
There is only one constant, one universal, it is the only real truth: causality. Action. Reaction. Cause and effect...There is no escape from it, we are forever slaves to it. Our only hope, our only peace is to understand it, to understand the 'why'. 'Why' is what separates us from them, you from me. 'Why' is the only real social power, without it you are powerless.
(Merovingian - Matrix Reloaded)
December 23rd, 2002, 07:25 PM
yes I have several layers of different IDS and monitoring "devices" running. I have checked the traffic and I see nothing odd about it, but then again I could be missing something.
I may drop a sniffer on those particular boxes to see what's going on.