This is a discussion on Odd sendmail problem within the Sco Unix forums, part of the Unix Operating Systems category; --> (SCO:Unix::5.0.4Eb rs.Unix504.0.1.a oss601a.Unix504 SCO:SendMail::8.8.5c) Howdy. An old system which has been working fine (up until recently, no hardware changes, ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| (SCO:Unix::5.0.4Eb rs.Unix504.0.1.a oss601a.Unix504 SCO:SendMail::8.8.5c) Howdy. An old system which has been working fine (up until recently, no hardware changes, and only unrelated software changes (non-os, just custom app)) has had it's sendmail daemon starting to act funny. The only tell-tail signs are that the load average sky rockets, /usr/spool/mqueue/ (/var/opt/K/SCO/SendMail/8.8.5c/spool/mqueue/) becomes inaccessable to 'ls' etc., and thus mail flow stops. A reboot of the machine returns it to normal usage again for some time, but it does it again 2-3 days later. I thought that maybe some garbage message was getting into the queue, but have no awy of verifying this. Two things I'd like to know: Is there any way to find out what is under that directory structure without relying upon the 'ls' and 'find' tools. Find shows the first 2 entries, but then stops. Beyond upgrading everything (not an option at this stage), any other ideas about how to combat this? The system is in a protected network, so it is highly doubtful that someone is attacking this from the outside (it is not in any name servers, and has to go through a firewall to get there). Firewall logs don't show anything out of normal. For the moment, I've tried ruling out an issue with the symlink, and just made a bare directory, hopefully leaving the old contents intact (after a reboot). bkx |
| |||
| In article <3f89de9f@dnews.tpgi.com.au>, Stuart J. Browne <stuart@promed.com.au> wrote: >(SCO:Unix::5.0.4Eb rs.Unix504.0.1.a oss601a.Unix504 SCO:SendMail::8.8.5c) > >Howdy. >An old system which has been working fine (up until recently, no >hardware changes, and only unrelated software changes (non-os, >just custom app)) has had it's sendmail daemon starting to act >funny. Are you 100% sure that the non-os custom ap isn't contributing to the problem. Some applications have been known to send mail - you know like LP does when a job is canceled. Sending mail to an unknown user on the system can cause a lot of 'undelivered messages'. >The only tell-tail signs are that the load >average sky rockets, /usr/spool/mqueue/ >(/var/opt/K/SCO/SendMail/8.8.5c/spool/mqueue/) becomes >inaccessable to 'ls' etc., and thus mail flow stops. >A reboot of the machine returns it to normal usage again for >some time, but it does it again 2-3 days later. Next time don't reboot the machine but kill sendmail. The go look at the messages in the queue. What does the ouptut of 'mailq' tell you. Look at the headers and see if you can spot the problem message. >I thought that maybe some garbage message was getting into the >queue, but have no awy of verifying this. If it cures everytime you restart and don't find the problem before you restart you are going to be having this until you determine what is causing it. >Two things I'd like to know: > Is there any way to find out what is under that directory structure >without relying upon the 'ls' and 'find' tools. Find shows the first 2 >entries, but then stops. So what does 'ls' give you. If it's too slow you probably have thousands in the queue and sorting them is going to take time so just type echo *. Output will be a mass all on one line, but pipe it through wc and you'll get an idea. > Beyond upgrading everything (not an option at this stage), >any other ideas about how to combat this? Why do you think upgrading witll cure this if you haven't found the problem? I'm not being facetious. You say the only change was a non-os application. If everthing was find until then it surely points a finger at that application. Of course if many people have root access anything could be wrong. And an upgrade instead of a complete reinstall could carry the problem forward. >The system is in a protected network, so it is highly doubtful >that someone is attacking this from the outside (it is not >in any name servers, and has to go through a firewall to get >there). I also doubt that is a problem. >Firewall logs don't show anything out of normal. You are looking at the wrong logs. Look at the sendmail logs. Your clue should be there. >For the moment, I've tried ruling out an issue with the symlink, >and just made a bare directory, hopefully leaving the old >contents intact (after a reboot). "hopefully?" And what 'bare directory' did you make? Unix systems DO NOT need to be rebooted in my experience unless it's something that has to do with a kernel related problem. That's based on 20 years with *n*x systems. I've used sendmail as the MTA for two ISPs and it just doesn't problems as you have described. I really bet it is NOT sendmail acting funny but something else causing it to accumulate tonnes of undeliverable messages. Bill -- Bill Vermillion - bv @ wjv . com |
| |||
| Bill Vermillion typed (on Tue, Oct 14, 2003 at 07:25:01PM +0000): | In article <3f89de9f@dnews.tpgi.com.au>, | Stuart J. Browne <stuart@promed.com.au> wrote: | >(SCO:Unix::5.0.4Eb rs.Unix504.0.1.a oss601a.Unix504 SCO:SendMail::8.8.5c) | | >Two things I'd like to know: | | > Is there any way to find out what is under that directory structure | >without relying upon the 'ls' and 'find' tools. Find shows the first 2 | >entries, but then stops. I suspect you're describing what happens when you run a command like find /some/path | xargs grep "some string" and it turns out that there is a pipe-file under /some/path. It is the 'grep' that is apparently stopping -- but it isn't stopping, it'll eternally read such a FIFO looking for "some string", until you lose patience and kill the command. If that's what's happening, then run: find /some/path ! -type p | xargs grep "some string" or better, to avoid the error you get when you try to grep a directory: find /some/path -type f | xargs grep "some string" -- JP |
| |||
| "Jean-Pierre Radley" <jpr@jpr.com> wrote in message news:20031014195619.GE1531@jpradley.jpr.com... > Bill Vermillion typed (on Tue, Oct 14, 2003 at 07:25:01PM +0000): > | In article <3f89de9f@dnews.tpgi.com.au>, > | Stuart J. Browne <stuart@promed.com.au> wrote: > | >(SCO:Unix::5.0.4Eb rs.Unix504.0.1.a oss601a.Unix504 SCO:SendMail::8.8.5c) > | > | >Two things I'd like to know: > | > | > Is there any way to find out what is under that directory structure > | >without relying upon the 'ls' and 'find' tools. Find shows the first 2 > | >entries, but then stops. > > I suspect you're describing what happens when you run a command like > > find /some/path | xargs grep "some string" > > and it turns out that there is a pipe-file under /some/path. It is > the 'grep' that is apparently stopping -- but it isn't stopping, it'll > eternally read such a FIFO looking for "some string", until you lose > patience and kill the command. > > If that's what's happening, then run: > > find /some/path ! -type p | xargs grep "some string" > > or better, to avoid the error you get when you try to grep a directory: > > find /some/path -type f | xargs grep "some string" Actually, all I did was: cd /usr/spool/mqueue find . -print It showed two entries, then halted. |
| |||
| > Stuart J. Browne <stuart@promed.com.au> wrote: > >(SCO:Unix::5.0.4Eb rs.Unix504.0.1.a oss601a.Unix504 SCO:SendMail::8.8.5c) > > > >Howdy. > > >An old system which has been working fine (up until recently, no > >hardware changes, and only unrelated software changes (non-os, > >just custom app)) has had it's sendmail daemon starting to act > >funny. > > Are you 100% sure that the non-os custom ap isn't contributing to > the problem. Some applications have been known to send mail - you > know like LP does when a job is canceled. Sending mail to > an unknown user on the system can cause a lot of 'undelivered > messages'. Our application does launch "/usr/lib/sendmail -t", but I can vouch for the data that goes to it. At worst, it should be forwarded straight through the queue (DS entry in /usr/lib/sendmail.cf), to a real mail server. > >The only tell-tail signs are that the load > >average sky rockets, /usr/spool/mqueue/ > >(/var/opt/K/SCO/SendMail/8.8.5c/spool/mqueue/) becomes > >inaccessable to 'ls' etc., and thus mail flow stops. > > >A reboot of the machine returns it to normal usage again for > >some time, but it does it again 2-3 days later. > > Next time don't reboot the machine but kill sendmail. The go > look at the messages in the queue. What does the ouptut of > 'mailq' tell you. Look at the headers and see if you can spot the > problem message. This is what I've attempted each time it's happened. The sendmail processes don't die. Mailq doesn't complete. I thought maybe it was lots-of-files, but after leaving it for over half an hour and still no response, no. I've seen lots-of-files before. Besides, using an unsorted 'find' would get past that, which it did not. > > >I thought that maybe some garbage message was getting into the > >queue, but have no awy of verifying this. > > If it cures everytime you restart and don't find the problem before > you restart you are going to be having this until you determine > what is causing it. > > >Two things I'd like to know: > > > Is there any way to find out what is under that directory structure > >without relying upon the 'ls' and 'find' tools. Find shows the first 2 > >entries, but then stops. > > So what does 'ls' give you. If it's too slow you probably have > thousands in the queue and sorting them is going to take time so > just type echo *. Output will be a mass all on one line, but > pipe it through wc and you'll get an idea. Forgot about 'echo *', will try that next time. > > Beyond upgrading everything (not an option at this stage), > >any other ideas about how to combat this? > > Why do you think upgrading witll cure this if you haven't found > the problem? I'm not being facetious. You say the only change > was a non-os application. If everthing was find until then it > surely points a finger at that application. Of course if many > people have root access anything could be wrong. And an upgrade > instead of a complete reinstall could carry the problem forward. I'd agree with you if we hadn't updated that application on over 100 other machines, some of which are identical. I'm more inclined to beleive that it has to do with external changes on the LAN (which I know is a pipe dream). The only thing the custom app does with regards to sendmail is issue '/usr/lib/sendmail -t < tmp.file'. The code which does that hasn't changed within the application. At no other time does it go anywhere near sendmail or the mail subsystems. > >The system is in a protected network, so it is highly doubtful > >that someone is attacking this from the outside (it is not > >in any name servers, and has to go through a firewall to get > >there). > > I also doubt that is a problem. > > >Firewall logs don't show anything out of normal. > > You are looking at the wrong logs. Look at the sendmail logs. > Your clue should be there. Unfortunately not. They are clean. Shows normal operation, and then just no operation. > >For the moment, I've tried ruling out an issue with the symlink, > >and just made a bare directory, hopefully leaving the old > >contents intact (after a reboot). > > "hopefully?" There are no qf* files in there, just an assortment of df* and xf*, so no header details. Nothing is obscenely large, just a mix and match of normal mail traffic, about 30 partial messages in all. 'hopefully' meaning that the nasty-killing of processes due to reboot leaves the partial files intact, and not cleaned off by a fsck during boot (I don't have console access, and walking the users out there through single-user boot-up, bypassing fsck's isn't really on the cards). > And what 'bare directory' did you make? Moved symlink '/usr/spool/mqueue' aside, created '/usr/spool/mqueue' as a directory. > Unix systems DO NOT need to be rebooted in my experience unless > it's something that has to do with a kernel related problem. That's > based on 20 years with *n*x systems. I'm of a similar mind (much to my bosses frustration), but when the load average keeps creeping, and processes don't die, choices are very limited, especially on production machines where people are trying to work. I'll take 10 min abuse for dropping everybody off, rather than 8hrs of "It's slow! it's slow!" in this sort of situation. > I've used sendmail as the MTA for two ISPs and it just doesn't > problems as you have described. I really bet it is NOT sendmail > acting funny but something else causing it to accumulate tonnes of > undeliverable messages. I use it for 1 ISP, as well as all inhouse, please all of our clients (in excess of 100+). I trust sendmail. I don't have that many running this particular version on OSR504 however, so I was wondering if there was a specific issue that I was unaware of. > Bill > -- > Bill Vermillion - bv @ wjv . com Thanks Bill bkx |
| ||||
| In article <3f8c880b$1@dnews.tpgi.com.au>, Stuart J. Browne <stuart@promed.com.au> wrote: >> Stuart J. Browne <stuart@promed.com.au> wrote: >> >(SCO:Unix::5.0.4Eb rs.Unix504.0.1.a oss601a.Unix504 >SCO:SendMail::8.8.5c) >> >Howdy. >> >An old system which has been working fine (up until recently, no >> >hardware changes, and only unrelated software changes (non-os, >> >just custom app)) has had it's sendmail daemon starting to act >> >funny. >> Are you 100% sure that the non-os custom ap isn't contributing to >> the problem. Some applications have been known to send mail - you >> know like LP does when a job is canceled. Sending mail to >> an unknown user on the system can cause a lot of 'undelivered >> messages'. >Our application does launch "/usr/lib/sendmail -t", but >I can vouch for the data that goes to it. At worst, it >should be forwarded straight through the queue (DS entry in >/usr/lib/sendmail.cf), to a real mail server. Is the DS defined to be another machine? If so what is the possiblity that machine is rejecting the mail and going back to a non-existant user. I had a situation [BSD web server] where the ownership of the web server was changed to www - so a break wouldn't give root access. When the account was created it was with a directory of 'nonexistant' and shell of '/bin/nologin'. One day I got 'file system full' message and found that 'www' had many MBs of messages and with no shell or any other access it went unoticed until things got full. Another time [a long time ago] I screwed up on something in sendmail - back in the 4.? days when it wasn't as robust and when I started it up one process was generating error messages and filling up the space at a rapid pace. [Irix on an SGI]. I tried killing a process I'd see in the ps but I would get no such processes. sendmail had run amuck and was spawning processes and finishing them faster than I could find them. I issued 'killall /usr/sbin/sendmail' and it took about 20 minutes for the system to become stable as it was killing processes only slightly faster than sendmail was generating. This was in a headless server environment where you don't really want to have to reboot unless absolutely neccessary. >> >The only tell-tail signs are that the load >> >average sky rockets, /usr/spool/mqueue/ >> >(/var/opt/K/SCO/SendMail/8.8.5c/spool/mqueue/) becomes >> >inaccessable to 'ls' etc., and thus mail flow stops. You can bet the load average went through the roof in the above scenario also. >> >A reboot of the machine returns it to normal usage again for >> >some time, but it does it again 2-3 days later. >> Next time don't reboot the machine but kill sendmail. The go >> look at the messages in the queue. What does the ouptut of >> 'mailq' tell you. Look at the headers and see if you can spot the >> problem message. >This is what I've attempted each time it's happened. The >sendmail processes don't die. Mailq doesn't complete. I thought >maybe it was lots-of-files, but after leaving it for over half >an hour and still no response, no. I've seen lots-of-files >before. Besides, using an unsorted 'find' would get past that, >which it did not. Are you trying to kill sendmail by the process ID. As noted above I found that was fruitless. Since all the sendmail files end in digits often when I look at a listing the queue and want to find out what is causing what and not running mailq I'll just list the directory and note the last digits of the last file. Then I can 'less' *NN and get the q and d files. You might try this after using echo * and then perhaps try echo *NN until you see a number or two that you can try cat *NN [if nothing else works] to at least get a clue on what has gone astray. >> >I thought that maybe some garbage message was getting into the >> >queue, but have no awy of verifying this. >> If it cures everytime you restart and don't find the problem before >> you restart you are going to be having this until you determine >> what is causing it. >> >Two things I'd like to know: >> > Is there any way to find out what is under that directory >> >structure without relying upon the 'ls' and 'find' tools. >> >Find shows the first 2 entries, but then stops. >> So what does 'ls' give you. If it's too slow you probably have >> thousands in the queue and sorting them is going to take time so >> just type echo *. Output will be a mass all on one line, but >> pipe it through wc and you'll get an idea. >Forgot about 'echo *', will try that next time. It's easy to forget and I cuss myself when I need it and take a while to remember it. I used to have it when trying to fix old Xenix systems where no one had a clue and I'd boot with one of my disks and use echo * to find out what was there. >> > Beyond upgrading everything (not an option at this stage), >> >any other ideas about how to combat this? >> Why do you think upgrading witll cure this if you haven't found >> the problem? I'm not being facetious. You say the only change >> was a non-os application. If everthing was find until then it >> surely points a finger at that application. Of course if many >> people have root access anything could be wrong. And an upgrade >> instead of a complete reinstall could carry the problem forward. >I'd agree with you if we hadn't updated that application on >over 100 other machines, some of which are identical. I'm more >inclined to beleive that it has to do with external changes on >the LAN (which I know is a pipe dream). It could be related to that. Particulary if the DS variable in your sendmail.cf is not empty and is pointing somewhere else. As to other machines which are identical I think one of Murphy's laws of computeing is 'Identical systems never are'. >The only thing the custom app does with regards to sendmail is >issue '/usr/lib/sendmail -t < tmp.file'. The code which does >that hasn't changed within the application. At no other time >does it go anywhere near sendmail or the mail subsystems. [When I refrenced /usr/bin/sendmail above I had forgotten that it was under /usr/lib on SCO]. On your tmp.file you read in with the -t I have no idea what is in it but try it with ONLY root at the localhost and see if it goes where it should, then try adding the others names one at a time. You might also wan't to temporarily make the DS a blank if it is not at this moment, and then restart sendmail. The .cf and the sendmail.cw [or local-host-names depending on versions], and the relay file are the only ones I've found that need a restart to be recognized. I've not seen any documentaion [it is too large] so that is just observation. >> >The system is in a protected network, so it is highly doubtful >> >that someone is attacking this from the outside (it is not >> >in any name servers, and has to go through a firewall to get >> >there). >> I also doubt that is a problem. >> >Firewall logs don't show anything out of normal. >> You are looking at the wrong logs. Look at the sendmail logs. >> Your clue should be there. >Unfortunately not. They are clean. Shows normal operation, and >then just no operation. Do you know when the app is supposed to be sending mail. If so can you trigger while performing a tail -f on the logfile. [Thea presuposes your system is not so busy that they scroll of the screen faster than you can see them like one of my servers]. >> >For the moment, I've tried ruling out an issue with the symlink, >> >and just made a bare directory, hopefully leaving the old >> >contents intact (after a reboot). >> "hopefully?" >There are no qf* files in there, just an assortment of df* and >xf*, so no header details. Nothing is obscenely large, just a >mix and match of normal mail traffic, about 30 partial messages >in all. The only time I've seen just a df* and nothing else appeared to be from some spammer somewhere as the files are huge and mostly binary. Could something like swen be on a local PC and deluging you with something like that? That's just a wild thought as I type this. >'hopefully' meaning that the nasty-killing of processes due to >reboot leaves the partial files intact, and not cleaned off by a >fsck during boot (I don't have console access, and walking the >users out there through single-user boot-up, bypassing fsck's >isn't really on the cards). fsck should not touch a thing if you did a reboot and not a poweroff. If your system is swamped it may take quite awhile for the shutdown to complete. If it never does and you have to take drastic measures you might try sync;sync;sync;reboot all on one line. If that works it might get most things flushed so an fsck won't occur - but dont bet on it. >> And what 'bare directory' did you make? >Moved symlink '/usr/spool/mqueue' aside, created >'/usr/spool/mqueue' as a directory. Understand. >> Unix systems DO NOT need to be rebooted in my experience unless >> it's something that has to do with a kernel related problem. That's >> based on 20 years with *n*x systems. >I'm of a similar mind (much to my bosses frustration), but >when the load average keeps creeping, and processes don't die, >choices are very limited, especially on production machines >where people are trying to work. Do you have a feel for how long it is before it starts getting sluggish? If so go for about 1/2 that time and just stop sendmail, and take a look to see if things are building up. Then fire it back up and see if it goes down at the normal time, or if it now takes from the restart time to the normal slowness time to get slow. [I hope I explained that properly]. If that is the case - the time for slowness is based upon sendmail start time - you might make a cron entry to restart sendmail nightly. If sendmail takes a while to shut down you might want to leave it off for 10 or more minutes. If you have a secondary MX machine that should be no problem as any mail destinated for that machine will be held there. [I'm a firm believer in secondary MX but I'm constantly amazed at how people have them]. >I'll take 10 min abuse for dropping everybody off, rather than 8hrs of >"It's slow! it's slow!" in this sort of situation. That's just good business sense. >> I've used sendmail as the MTA for two ISPs and it just doesn't >> problems as you have described. I really bet it is NOT sendmail >> acting funny but something else causing it to accumulate tonnes of >> undeliverable messages. >I use it for 1 ISP, as well as all inhouse, please all of our >clients (in excess of 100+). I trust sendmail. I don't have that >many running this particular version on OSR504 however, so I was >wondering if there was a specific issue that I was unaware of. I run a small niche market ISP - but in the past 10 days I've seen my daily maillog grow 25% larger as the spam just keeps increasing. There were 245,808 lines in yesterdays maillog and that's only a user population of about 150. [One web site comes up #1 in Google and it has never been adversised - only in the last two months has any development work been done on it in two years. I just checked and I tossed away over 49,000 messages destined for that one yesterday. So that's 20 percent of the total. For that I don't send rejects or anything. I just route then to /dev/null. Not polite but the only effective way to handle those. Being on a tier 1 backbone makes life interesting - and sometimes hectic. I hope some of these ideas may help. Bill -- Bill Vermillion - bv @ wjv . com |