Help Desperatly! Server crash every day httpd consumes all memory
I´ve been experiencing server lockups almost every day, when it comes back i have to restart DNS/Name (server BIND) in order to have my sites back on line. Have checked the logs and there are no indications of what´s making it crash.I´ve have sttoped Spamd, Analog and calmv still not help, also checked user cron jobs etc.
Server is a RHEL 3 2.4.21-4.0.1.EL (Did the IOwait tweak)
I Have afp, PRM and SIM installed and still no help, load goes over 28 and the server crashes, I noticed all my memory gets consumed (physical and Swap), usually i see 2 nobody, httpd processes that take all the memory before it crashes.
I´ve to keep monitoring server loads and reset exim, httpd and mysqld when CPU goes over 10 so it will bring load down and the server will work Ok for 6, 8 or more hours, before load goes hig.
Here the last top when it crashed.
***********************************************
06:38:25 up 9:11, 1 user, load average: 28.36, 10.87, 4.44
99 processes: 76 sleeping, 22 running, 0 zombie, 1 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 0.0% 0.0% 94.2% 0.8% 0.1% 2.9% 1.8%
Mem: 1022480k av, 1014268k used, 8212k free, 0k shrd, 7332k buff
967604k active, 13620k inactive
Swap: 2097136k av, 2097136k used, 0k free 15992k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
15327 nobody 16 0 1458M 490M 504 D 1.9 49.1 0:24 0 /usr/local/apache/bin/httpd
15325 nobody 16 0 1391M 433M 460 D 2.2 43.4 0:23 0 /usr/local/apache/bin/httpd
16816 nobody 15 0 1784 1388 988 S 0.0 0.1 0:00 0 proftpd: connected: 12.96.160.116 (12.96.160.116:14473)
16804 nobody 15 0 1752 1268 956 S 0.0 0.1 0:00 0 proftpd: connected: 12.96.160.116 (12.96.160.116:13071)
10180 named 25 0 5100 1232 712 S 1.0 0.1 0:08 0 /usr/sbin/named -u named
15350 root 35 19 8872 1204 416 D N 0.9 0.1 0:12 0 pkgacct - pereiran
16814 nobody 16 0 14840 1076 816 R 2.5 0.1 0:00 0 /usr/local/apache/bin/httpd
15328 nobody 15 0 17092 1036 572 D 1.0 0.1 0:00 0 /usr/local/apache/bin/httpd
14631 nobody 16 0 18044 1000 692 R 1.0 0.0 0:00 0 /usr/local/apache/bin/httpd
15313 nobody 16 0 15832 896 552 D 2.0 0.0 0:00 0 /usr/local/apache/bin/httpd
16811 nobody 16 0 14700 884 680 R 2.9 0.0 0:00 0 /usr/local/apache/bin/httpd
15214 nobody 16 0 17428 880 376 R 1.1 0.0 0:00 0 /usr/local/apache/bin/httpd
16809 nobody 16 0 14688 844 668 R 2.4 0.0 0:00 0 /usr/local/apache/bin/httpd
16813 nobody 16 0 14672 844 652 R 2.7 0.0 0:00 0 /usr/local/apache/bin/httpd
5649 mailman 15 0 3812 836 416 S 0.3 0.0 0:00 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=BounceRunner:0:1
15980 root 16 0 836 836 452 R 0.4 0.0 0:02 0 top
15255 nobody 16 0 14844 836 660 R 1.5 0.0 0:00 0 /usr/local/apache/bin/httpd
16808 nobody 16 0 14672 828 652 R 3.0 0.0 0:00 0 /usr/local/apache/bin/httpd
16810 nobody 16 0 14620 788 600 R 2.3 0.0 0:00 0 /usr/local/apache/bin/httpd
16826 mailnull 19 0 1008 788 648 R 2.9 0.0 0:00 0 /usr/sbin/exim -bd -q60m
14712 nobody 16 0 15884 784 368 R 2.0 0.0 0:00 0 /usr/local/apache/bin/httpd
16807 root 16 0 3800 772 584 R 1.7 0.0 0:00 0 webmaild - serving 64.76 restart
16812 nobody 16 0 14560 724 540 R 3.6 0.0 0:00 0 /usr/local/apache/bin/httpd
16817 nobody 15 0 14512 704 492 R 0.6 0.0 0:00 0 /usr/local/apache/bin/httpd
16819 nobody 15 0 14512 704 492 R 0.9 0.0 0:00 0 /usr/local/apache/bin/httpd
16818 nobody 16 0 14512 700 492 R 0.8 0.0 0:00 0 /usr/local/apache/bin/httpd
16823 nobody 16 0 14548 692 528 R 3.1 0.0 0:00 0 /usr/local/apache/bin/httpd
16822 nobody 15 0 14516 688 496 S 0.0 0.0 0:00 0 /usr/local/apache/bin/httpd
5657 mailman 15 0 3136 684 264 S 0.0 0.0 0:00 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=OutgoingRunner:0
16820 nobody 15 0 14508 680 488 S 0.0 0.0 0:00 0 /usr/local/apache/bin/httpd
16821 nobody 15 0 14508 680 488 S 0.0 0.0 0:00 0 /usr/local/apache/bin/httpd
15258 nobody 16 0 15760 656 372 D 2.0 0.0 0:00 0 /usr/local/apache/bin/httpd
15312 nobody 15 0 15116 656 500 D 0.6 0.0 0:00 0 /usr/local/apache/bin/httpd
5651 mailman 15 0 3040 648 248 D 0.2 0.0 0:00
**************************************
And this was yesterday
**************************************
19:00:10 up 3 days, 23:16, 1 user, load average: 28.62, 9.49, 3.62
110 processes: 108 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 0.4% 0.0% 5.7% 0.8% 0.0% 92.9% 0.0%
Mem: 1022480k av, 1014136k used, 8344k free, 0k shrd, 5308k buff
980736k active, 2460k inactive
Swap: 2097136k av, 2097132k used, 4k free 13308k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
6546 nobody 15 0 1545M 478M 644 D 0.8 47.9 0:24 0 /usr/local/apache/bin/httpd -DSSL
6553 nobody 15 0 15112 1028 796 D 0.8 0.1 0:00 0 /usr/local/apache/bin/httpd -DSSL
6 root 15 0 0 0 0 SW 0.4 0.0 0:30 0 kscand
32204 named 25 0 7136 1120 848 S 0.4 0.1 0:53 0 /usr/sbin/named -u named
6539 nobody 15 0 14928 1020 784 D 0.4 0.0 0:00 0 /usr/local/apache/bin/httpd -DSSL
6543 nobody 15 0 14928 1024 784 D 0.4 0.1 0:00 0 /usr/local/apache/bin/httpd -DSSL
6547 nobody 15 0 1335M 452M 644 D 0.4 45.2 0:21 0 /usr/local/apache/bin/httpd -DSSL
6589 nobody 15 0 14988 1244 860 D 0.4 0.1 0:00 0 /usr/local/apache/bin/httpd -DSSL
6593 nobody 15 0 14896 936 768 D 0.4 0.0 0:00 0 /usr/local/apache/bin/httpd -DSSL
6596 nobody 15 0 14896 936 768 D 0.4 0.0 0:00 0 /usr/local/apache/bin/httpd -DSSL
6601 root 20 0 456 440 376 D 0.4 0.0 0:00 0 CROND
1 root 15 0 112 80 60 S 0.0 0.0 0:08 0 init
2 root 15 0 0 0 0 SW 0.0 0.0 0:02 0 keventd
3 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 kapmd
4 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0
7 root 15 0 0 0 0 SW 0.0 0.0 0:01 0 bdflush
5 root 15 0 0 0 0 DW 0.0 0.0 1:09 0 kswapd
8 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 kupdated
9 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd
13 root 15 0 0 0 0 SW 0.0 0.0 2:17 0 kjournald
69 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 khubd
2940 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 kjournald
2941 root 15 0 0 0 0 SW 0.0 0.0 0:44 0 kjournald
3628 root 15 0 384 352 304 S 0.0 0.0 0:26 0 syslogd -m 0
3632 root 15 0 188 136 132 S 0.0 0.0 0:00 0 klogd -x
4876 root 15 0 1776 732 488 D 0.0 0.0 0:02 0 chkservd
4969 root 15 0 412 384 328 D 0.0 0.0 0:00 0 crond
5275 root 34 19 11020 612 444 D N 0.0 0.0 13:22 0 cpanellogd - sleeping for logs
5330 cpanel 15 0 1052 268 264 S 0.0 0.0 0:01 0 /usr/bin/stunnel-4.04local /usr/local/cpanel/etc/stunnel/default/stunnel.conf
5357 mailman 15 0 3044 148 144 S 0.0 0.0 0:00 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/mailmanctl -s start
5365 root 15 0 140 88 84 S 0.0 0.0 0:00 0 rhnsd --interval 240
5369 mailman 15 0 4060 864 400 S 0.0 0.0 0:01 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=ArchRunner:0:1 -
5370 mailman 15 0 6740 980 496 S 0.0 0.0 0:04 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=BounceRunner:0:1
5371 mailman 15 0 3040 604 248 S 0.0 0.0 0:00 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=CommandRunner:0:
5372 mailman 15 0 5508 908 492 D 0.0 0.0 0:04 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=IncomingRunner:0
5373 mailman 15 0 3076 648 248 S 0.0 0.0 0:00 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=NewsRunner:0:1 -
5374 mailman 15 0 4644 1016 548 S 0.0 0.0 0:19 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=OutgoingRunner:0
5375 mailman 15 0 4424 856 452 D 0.0 0.0 0:05 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=VirginRunner:0:1
5376 mailman 15 0 3040 196 192 S 0.0 0.0 0:00 0 /usr/bin/python2.2 /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=RetryRunner:0:1
************************************
This is driving me crazy, please any one can help me with some advice..