eth0 up, down, up again...

Okay, I am officially stumped.

A few months ago, we began leasing a PIII server from Tulip Systems as a mailserver. Since pretty literally the first week, we've had inexplicable problems with the machine randomly crashing... or so we thought. The machine would become unresponsive, they'd reboot it, and everything would be fine again for anywhere from twelve hours to twelve days, before happening again. The only other wierd activity we noticed was a higher than expected number of collisions on eth0 - around ten percent of packets handled. Every time it seemed to crash, we were stumped, as there wasn't a single log entry anywhere that pointed to a cause. This is Centos 4, running (at present) a 2.6.9-22 kernel.
The server is running pretty much just Apache and Qmail. Qmail is compiled from source (of course), while everything else is from RPM, kept updated. We really don't think it's a software problem. After much head-scratching, we convinced Tulip to swap the HDD into a completely new chassis - different mobo, different RAM, different everything - hoping that'd solve the problem. We were really hopeful for a week or so, but then it went down again last night. We groused a bit, submitted a reboot request, and were amazed when the server seemed to come back up about sixty seconds later.
We logged in to make sure everything was running, and much to our surprise, the server hadn't been rebooted yet - it showed as being up for several days. A quick check of /var/log/messages showed the following:

Nov 23 00:05:56 sparky kernel: e100: eth0: e100_watchdog: link up, 10Mbps, half-duplex

Which coincided with the system becoming available again.

Tulip's response when I asked about this was "We monitor our network constantly, and the network didn't go down." Their network may not have gone down, but it sure looks to me like eth0 on our server did.

Anyone have ANY ideas how to solve this? This is so incredibly frustrating - I feel like we've finally pinpointed the cause of all the problems - but I haven't a damned clue what to do about it. Help? Please? eth0 up, down, up again...

 

 

 

 

Top