This is a discussion on help analyzing low system(with sar/vmstat/u386mon/sarcheck data) within the Sco Unix forums, part of the Unix Operating Systems category; --> system configuration: sco 5.0.6, with about 170 ttys loggedin by telnet, two 2G cpu, 4G memory (1) output of ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| system configuration: sco 5.0.6, with about 170 ttys loggedin by telnet, two 2G cpu, 4G memory (1) output of sar -A: SCO_SV zjyw-38 3.2v5.0.6 i80386 03/31/2008 09:01:04 %usr %sys %wio %idle (-u) bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/ s (-b) device %busy avque r+w/s blks/s avwait avserv (-d) c_hits cmisses (hit %) (-n) rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s (-y) scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s (- c) swpin/s bswin/s swpot/s bswot/s pswch/s (-w) iget/s namei/s dirbk/s (-a) runq-sz %runocc swpq-sz %swpocc (-q) proc-sz ov inod-sz ov file-sz ov lock-sz (-v) msg/s sema/s (-m) vflt/s pflt/s pgfil/s rclm/s (-p) freemem freeswp availrmem availsmem (-r) cpybuf/s slpcpybuf/s (-B) dptch/s idler/s swidle/s (-R) ovsiohw/s ovsiodma/s ovclist/s (-g) mpbuf/s ompb/s mphbuf/s omphbuf/s pbuf/s spbuf/s dmabuf/s sdmabuf/s (-h) Average 6 20 2 72 (-u) Average 6 193153 100 68 1071 94 0 0 (-b) Average Sdsk-0 100.00 1.00 22.86 148.17 0.00 57.19 (-d) Average 453343 8647 (98%) (-n) Average 25 1 5553 0 0 0 (-y) Average 241413 175778 7502 3.37 3.48 1171261 72597 (- c) Average 0.00 0.0 0.00 0.0 951 (-w) Average 7614 990 1768 (-a) Average 2.1 100 (-q) Average 0.00 0.00 (-m) Average 76.83 158.98 0.05 0.00 (-p) Average 611232 1048576 799826 513961 (-r) Average 0.00 0.00 (-B) Average 2707.10 376.22 45.88 (-R) Average 0.00 0.00 0.00 (-g) Average 0.04 0.00 16.64 0.00 0.00 0.00 0.00 0.00 (-h) (2) output of sar: # sar -r 1 10 SCO_SV zjyw-38 3.2v5.0.6 i80386 03/31/2008 08:46:41 freemem freeswp availrmem availsmem (-r) 08:46:42 680115 1048576 802064 648186 08:46:43 680048 1048576 802062 648143 08:46:44 679997 1048576 802062 648143 # sar -w 1 10 SCO_SV zjyw-38 3.2v5.0.6 i80386 04/02/2008 11:48:43 swpin/s bswin/s swpot/s bswot/s pswch/s (-w) 11:48:44 0.00 0.0 0.00 0.0 1383 11:48:45 0.00 0.0 0.00 0.0 1321 11:48:46 0.00 0.0 0.00 0.0 1417 11:48:47 0.00 0.0 0.00 0.0 1247 sar -p: 08:29:21 vflt/s pflt/s pgfil/s rclm/s (-p) 08:29:22 541.18 1567.65 0.00 0.00 08:29:23 197.06 109.80 0.00 0.00 08:29:24 44.55 41.58 0.00 0.00 08:29:25 50.98 134.31 0.00 0.00 08:29:26 85.15 388.12 0.00 0.00 08:29:27 111.76 358.82 0.00 0.00 08:29:28 534.31 726.47 0.00 0.00 08:29:29 216.67 131.37 0.00 0.00 08:29:30 290.10 550.50 0.00 0.00 08:29:31 244.12 138.24 0.00 0.00 08:29:32 33.98 113.59 0.00 0.00 08:29:33 103.96 279.21 0.00 0.00 (3) output of vmstat: PROCS PAGING SYSTEM CPU r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us su id 1 743 0 1048576 382 0 1222 0 783 0 0 0 0 0 0 170730 547 11 14 75 1 743 0 1048576 0 0 0 0 65 0 0 0 0 0 0 123108 583 2 11 87 4 737 0 1048576 13 0 0 0 132 0 0 0 0 0 0 275090 695 8 29 63 3 738 0 1048576 491 0 1080 0 439 0 0 0 0 0 0 358404 800 5 35 60 3 738 0 1048576 13 0 0 0 41 0 0 0 0 0 0 512184 741 12 37 51 3 739 0 1048576 76 0 571 0 269 0 0 0 0 0 0 208337 755 7 25 68 3 740 0 1048576 9 0 198 0 117 0 0 0 0 0 0 283185 662 10 18 72 4 739 0 1048576 10 0 2 0 49 0 0 0 0 0 0 248484 684 10 15 75 2 737 0 1048576 203 0 3 0 125 0 0 0 0 0 0 277137 615 8 27 65 2 737 0 1048576 28 0 2 0 2 0 0 0 0 0 0 378153 616 8 32 60 2 739 0 1048576 644 0 4015 0 1149 0 0 0 0 0 0 92687 882 6 17 77 4 739 0 1048576 244 0 1222 0 672 0 0 0 0 0 0 152569 814 10 13 77 1 742 0 1048576 465 0 3632 0 1191 0 0 0 0 0 0 242385 902 11 28 61 2 743 0 1048576 407 0 1572 0 957 0 0 0 0 0 0 157772 531 9 15 76 (4)u386mon's output: u386mon 2.74/SCO 3.2 - zjyw-38 15:18:49 wht@n4hgf ---- CPU --- tot usr ker brk --------------------------------------------------- 2 Sec Avg % 30 8 22 0 uuuukkkkkkkkkkk 10 Sec Avg % 32 6 26 0 uuukkkkkkkkkkkkk 20 Sec Avg % 30 5 25 0 uukkkkkkkkkkkk ---- Wait -- tot io pio swp -- (% of real time) ------------------------------- 2 Sec Avg % 11 11 0 0 iiiii 10 Sec Avg % 9 9 0 0 iiii 20 Sec Avg % 6 6 0 0 iii ---- Sysinfo/Minfo --- (last 2031 msec activity) ------------------------------ bread 2 readch 51780167 pswitch 1555 vfault 381 unmodsw 0 bwrite 54 writch 171667 syscall 190057 demand 381 unmodfl 0 lread 388468 rawch 94 sysread 175629 pfault 289 psoutok 0 lwrite 6884 canch 6 syswrit 3620 cw 189 psinfai 0 phread 0 outch 7420 sysfork 7 steal 100 psinok 0 phwrite 0 msg 0 sysexec 7 frdpgs 0 rsout 0 swapin 0 sema 0 vfpg 0 rsin 0 swapout 0 maxmem -1080464krunque 0 sfpg 0 bswapin 0 frmem -1688772krunocc 0 vspg 0 pages on bswapout 0 mem used 20% swpque 0 sspg 0 swap 0 iget 14417 nswap 524288k swpocc 0 pnpfault 0 cache 992 namei 1795 frswp 524288k wrtfault 0 file 0 dirblk 3423 swp used 0% ---- Sysinfo/Minfo --- (last 2041 msec activity) ------------------------------ bread 0 readch 89470697 pswitch 2339 vfault 208 unmodsw 0 bwrite 0 writch 30924 syscall 293135 demand 206 unmodfl 0 lread 531488 rawch 84 sysread 223028 pfault 238 psoutok 0 lwrite 338 canch 1 syswrit 708 cw 84 psinfai 0 phread 0 outch 10410 sysfork 3 steal 154 psinok 0 phwrite 0 msg 0 sysexec 4 frdpgs 0 rsout 0 swapin 0 sema 0 vfpg 0 rsin 0 swapout 0 maxmem -1080464krunque 1 sfpg 0 bswapin 0 frmem -1680708krunocc 1 vspg 0 pages on bswapout 0 mem used 20% swpque 0 sspg 0 swap 0 iget 2685 nswap 524288k swpocc 0 pnpfault 0 cache 455 namei 775 frswp 524288k wrtfault 0 file 0 (5) part of output of sarcheck: The following indication(s) of a memory shortage were seen: The reclaim rate was at least one quarter of the page fault rate in only 0.0 percent of the samples. This statistic can be used to confirm the presence of an occasional memory-poor condition. The average swap out transfer request rate was 1768.3 per second, which is an indication of a memory-poor condition. The amount of freeswp did not change during the monitoring period, indicating that the system has plenty of memory installed. The average number of free pages usually did not stray far above the value of GPGSHI. This indicates that vhand, the page stealing daemon, was usually active and the memory poor condition seen on this system has resulted in increased CPU overhead as well as additional disk activity. Both GPGSHI and GPGSLO were set to high values, relative to the amount of memory present. Since paging was seen and these parameters are set in a way that increases the activity of the page stealing vhand daemon, consider lowering the values of GPGSHI and GPGSLO. The difference between GPGSLO and GPGSHI is large. This may create a CPU bottleneck while a large amount of dirty pages are being written to disk. *********** My questions are: (1)sarcheck's output: "The following indication(s) of a memory shortage were seen: The reclaim rate was at least one quarter of the page fault rate in only 0.0 percent of the samples. This statistic can be used to confirm the presence of an occasional memory-poor condition." --> What does this statement mean? (2)sarcheck's output: "The average swap out transfer request rate was 1768.3 per second, which is an indication of a memory-poor condition." -->How is the number 1768.3 calculated out? According to the sar and vmstat's output, there seems to be no swap, why does sarcheck say "The average swap out transfer request rate was 1768.3 per second" and "there is memory-poor condition"? (3)sarcheck's output: "The average number of free pages usually did not stray far above the value of GPGSHI." -->GPGSHI's value is 6000, and according the output of sar- r:freemem 680115 is significantly higher than the value of GPGSHI. Why sarcheck's conclusion is opposite? (4)Is the output of sar-p normal? Is vflt or pflt too large? (5)Is the output of vmstat normal? Is sy or cs too large? (6)In the u386mon's output,steal is not zero,Why? System's freemem never fall below GPGSLO. Sorry for so many questions, and appreciate for anyone's advice and help best regards for all |
| |||
| yannanqi@126.com wrote: > system configuration: sco 5.0.6, with about 170 ttys loggedin by > telnet, two 2G cpu, 4G memory A buncha other stuff, in ugly format, not worth trying to edit for quoting. Your sar output looks reasonable for a system as described. So do the other utils (modulo a few display bugs in u386mon). The system isn't swapping at all and has loads more memory than it needs. sarcheck looks like it isn't prepared to deal with some details of the sar outputs -- is it the latest sarcheck for OSR506? The big thing missing in all that output is your description of what's wrong. It looks like a system that has a lot of work to do and is doing it without complaint. Plus some spurious nonsense from sarcheck. If the whole problem is the advice from sarcheck, ignore it (ask them for advice, though...) The one possibly questionable stat is that the disk is 100% busy. But you posted a snapshot, we can't tell if that was a momentary burst or continuous. If it's continuous, the system might benefit from a faster disk subsystem (faster drive, faster HBA, maybe an external RAID of the sort that's intended to speed things up rather than or in addition to giving redundancy -- RAID 0 or RAID 10). Although it's 100% busy, the delay stats didn't look bad, so I'm not sure if this relates to your issue. If there's an actual performance problem, why don't you describe it instead of posting a morass of details that don't seem to show much wrong? In your other message about NBUF: > On OSR506 platform with 4G memory, the mtune shows:NBUF > 0 24 450000,that means the maximum value of NBUF is > 450000,but if I give 1000000 to NBUF,when system starts,it give the > following message: > > kernel: Hz = 100, i/o bufs = 467116k (high bufs = 466092k)CONFIG: > Buffer allocation was reduced (NBUF reduced to 467116) > > (1)That means NBUF gets a value of 467116, where does this number come > from? I would guess that 450000 was someone's back-of-napkin calculation of the most buffers that could guaranteed to be accomodated within the constraints of other kernel structures. When you demand 1000000 buffers, you cause the kernel to do a live calculation of the same constraints, only now it has more specific information about certain structures whose sizes are system-specific. Some of the constraints on your system aren't quite the theoretical limits, so it can squeeze in a few more buffers. You should expect that by demanding the absolute maximum buffers, you may be invisibly squeezing down the size of other kernel structures. This could potentially hurt performance or stability. (I'm not saying that it _does_ hurt, I don't really know.) You can also reasonably expect that SCO _tested_ with 450000 buffers but not with 467116. I I doubt the 3.8% increase in buffers is making so much difference in performance that it's worth running in an untested configuration. > ps: > (2) If NBUF has a value other than zero, Is it ok to let NHBUF=0? Can > NHBUF self-tune according to NBUF when NBUF is not set to zero? It should auto-tune. You can observe runtime values of these by doing: # crash > v | grep buf v_buf: 450000 v_hbuf: 524288 If you boot with different forced NBUF (v_buf) values, you should see v_hbuf (NHBUF) float to different values. It's always a power of 2 so you'll have to make sharp changes to NBUF to see NHBUF change. > (3)When does MAXBUF have effect, when NBUF is zero or NBUF is not zero? MAXBUF is an obsolete parameter, no longer edited by configure(ADM), no longer meaningful to the kernel. >Bela< |
| |||
| Bela,I can't express my heart by words.Only one word:you're great,great thanks! You freeed me through clear explanation. But the sco system really encounters performance problem: the telnet users' working interface is very slow,the items of the dropdown list fields slowly appears one by one. (1)sar -b: The %rcache and %wcache seem to be normal. SCO_SV zjyw-38 3.2v5.0.6 i80386 04/03/2008 14:25:10 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/ s (-b) 14:25:11 18 36849 100 388 5282 93 0 0 14:25:12 3 158264 100 33 316 89 0 0 14:25:13 0 59686 100 0 226 100 0 0 14:25:14 0 79142 100 0 159 100 0 0 14:25:15 0 164502 100 0 50 100 0 0 14:25:16 0 169043 100 0 181 100 0 0 14:25:17 0 61087 100 0 7 100 0 0 14:25:18 0 15037 100 0 16 100 0 0 14:25:19 0 230439 100 0 102 100 0 0 14:25:20 0 55642 100 0 61 100 0 0 14:25:21 0 37027 100 0 12 100 0 0 14:25:22 1 127536 100 0 43 100 0 0 14:25:23 0 133101 100 18 122 85 0 0 14:25:24 0 17444 100 1 2 60 0 0 14:25:25 0 0 0 0 0 100 0 0 14:25:26 0 7142 100 1 12 92 0 0 14:25:27 0 3 100 4 4 11 0 0 14:25:28 0 146721 100 0 79 100 0 0 14:25:29 0 8179 100 0 37 100 0 0 14:25:30 0 175348 100 0 37 100 0 0 14:25:31 0 98968 100 0 81 100 0 0 14:25:32 0 67449 100 0 26 100 0 0 14:25:33 0 66537 100 0 27 100 0 0 14:25:34 0 19567 100 0 8 100 0 0 14:25:35 0 99711 100 0 31 100 0 0 14:25:36 0 45507 100 0 86 100 0 0 14:25:37 0 98409 100 0 34 100 0 0 14:25:39 0 85748 100 10 80 88 0 0 14:25:40 130 156812 100 6 5129 100 0 0 14:25:41 0 14653 100 421 143 0 0 0 14:25:42 0 431218 100 0 284 100 0 0 14:25:43 0 26278 100 0 81 100 0 0 14:25:44 0 77340 100 0 116 100 0 0 14:25:45 0 18695 100 0 18 100 0 0 14:25:46 0 21389 100 0 20 100 0 0 14:25:47 0 149728 100 11 68 84 0 0 14:25:48 0 1027 100 0 56 100 0 0 (2) sar -d: 14:24:48 Sdsk-0 4.95 1.00 7.92 15.84 0.00 6.25 14:24:49 14:24:50 14:24:51 14:24:52 14:24:53 14:24:54 Sdsk-0 1.98 1.00 4.95 9.90 0.00 4.00 14:24:55 14:24:56 14:24:57 Sdsk-0 10.00 1.00 13.00 26.00 0.00 7.69 14:24:58 Sdsk-0 100.00 1.00 53.92 178.43 0.00 21.82 14:24:59 Sdsk-0 100.00 1.00 240.59 1976.24 0.00 44.07 14:25:00 Sdsk-0 0.99 1.00 0.99 1.98 0.00 10.00 14:25:01 14:25:02 14:25:03 14:25:04 14:25:05 14:25:06 14:25:07 14:25:08 14:25:09 Sdsk-0 0.99 1.00 0.99 9.90 0.00 10.00 14:25:10 14:25:11 Sdsk-0 100.00 1.00 134.65 857.43 0.00 76.25 14:25:12 Sdsk-0 21.78 1.00 6.93 27.72 0.00 31.43 14:25:13 14:25:14 14:25:15 14:25:16 14:25:17 14:25:18 14:25:19 14:25:20 14:25:21 Sdsk-0 1.00 1.00 1.00 2.00 0.00 10.00 14:25:22 14:25:23 Sdsk-0 17.65 1.00 18.63 37.25 0.00 9.47 (3)vmstat: Thu Mar 27 16:23:31 CST 2008 # vmstat 1 100 PROCS PAGING SYSTEM CPU r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us su id 2 921 0 1048576 8 0 0 0 2 0 0 0 0 0 0 1128585 54570 28 27 45 2 921 0 1048576 62 0 0 0 0 0 0 0 0 0 0 1174108 57445 20 32 48 2 918 0 1048576 5 0 0 0 14 0 0 0 0 0 0 1149028 52759 35 36 29 2 920 0 1048576 26 0 425 0 104 0 0 0 0 0 0 1180505 55345 24 33 43 2 920 0 1048576 0 0 0 0 0 0 0 0 0 0 0 1204870 57131 29 26 45 5 917 0 1048576 28 0 0 0 1 0 0 0 0 0 0 1154900 53069 30 30 40 2 922 0 1048576 31 0 425 0 104 0 0 0 0 0 0 1232803 55362 27 35 38 3 921 0 1048576 30 0 0 0 0 0 0 0 0 0 0 1153011 47926 27 37 36 3 925 0 1048576 111 0 822 0 165 0 0 0 0 0 0 1165042 55808 34 39 27 2 926 0 1048576 24 0 85 0 18 0 0 0 0 0 0 1203487 46580 32 43 25 4 925 0 1048576 6 0 85 0 15 0 0 0 0 0 0 1215516 55999 23 35 42 ************************* My thoughts: (1)By vmstat's output,the "sy:system calls" and "cs:context switch" are very high. Are these value normal? (The sco system is only an endpoint-telnet-server, and the telnet users don't have much business, only querying & charging--the database server is on third part) (2)By vmstat's output,does the "cch" effect the system's performance? |
| |||
| On Wed, Apr 02, 2008, yannanqi@126.com wrote: >Bela,I can't express my heart by words.Only one word:you're >great,great thanks! You freeed me through clear explanation. > >But the sco system really encounters performance problem: the telnet >users' working interface is very slow,the items of the dropdown list >fields slowly appears one by one. Does the system exhibit this type of performance on the console? If it doesn't, the problem is most likely network related. If it is a network problem, it could be a bad NIC, network switch, hub, or even another machine on the LAN with the same IP address as the server. DNS problems usually show up with long initial connection times as the system attempts to resolve the host name of the connecting IP. I have seen major problems with NICs which show high numbers of errors on incoming and outgoing packets. On Linux systems the ifconfig command shows the error history, but SCO's doesn't, at least not on the OSR 5.0.6a systems we have here. Bill -- INTERNET: bill@celestial.com Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676 That rifle on the wall of the labourer's cottage or working class flat is the symbol of democracy. It is our job to see that it stays there. --GEORGE ORWELL |
| |||
| yannanqi@126.com wrote: > But the sco system really encounters performance problem: the telnet > users' working interface is very slow,the items of the dropdown list > fields slowly appears one by one. > > (1)sar -b: The %rcache and %wcache seem to be normal. They're actually exceptionally high (since you have so much buffer cache). Shouldn't be causing a performance problem. > (2) sar -d: > 14:24:48 Sdsk-0 4.95 1.00 7.92 15.84 0.00 6.25 > 14:24:49 > 14:24:50 > 14:24:51 > 14:24:52 > 14:24:53 > 14:24:54 Sdsk-0 1.98 1.00 4.95 9.90 0.00 4.00 > 14:24:55 > 14:24:56 > 14:24:57 Sdsk-0 10.00 1.00 13.00 26.00 0.00 7.69 > 14:24:58 Sdsk-0 100.00 1.00 53.92 178.43 0.00 21.82 > 14:24:59 Sdsk-0 100.00 1.00 240.59 1976.24 0.00 44.07 > 14:25:00 Sdsk-0 0.99 1.00 0.99 1.98 0.00 10.00 > 14:25:01 > 14:25:02 Alternating 100% busy and idle, hmmm. How full are the filesystems? HTFS on OSR506 (and OSR507 without at least MP3 or so) was extremely inefficient at allocating space on nearly-full filesystems. On large filesystems (100GiB would be large enough), this inefficiency was costly in both CPU and disk I/O terms. But your buffer cache stats suggest this is not the problem. Much more likely: you've got dirty buffer cache storms. You've given the system 450MB of buffer cache. A process that was writing very quickly to an already allocated file could dirty tens of megabytes in a few seconds. Those blocks would stay in cache until bdflush was run, then they would all try to write to disk at the same time, busying out the disk for a long time. To mitigate this, change BDFLUSHR to 1 (run bdflush as often as possible, once a second) and NAUTOUP to 2 (flush buffers that are no more than 2 seconds old). This costs a bit of extra CPU, but your system has plenty to spare. I seem to remember that one of the OSR507 patches also improved some buffer cache handling. With your 506, the system might actually run _faster_ with a much smaller buffer cache. You should test it with a sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't help. Because of the buffer cache & filesystem space allocation improvements, this system would probably be a lot happier under OSR507 + MP5. (Or it might make no difference... can't really tell without trying.) > (3)vmstat: > Thu Mar 27 16:23:31 CST 2008 > # vmstat 1 100 > > PROCS PAGING SYSTEM CPU > r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us su id > > 2 921 0 1048576 8 0 0 0 2 0 0 0 0 0 0 1128585 54570 28 27 45 > 2 921 0 1048576 62 0 0 0 0 0 0 0 0 0 0 1174108 57445 20 32 48 > 2 918 0 1048576 5 0 0 0 14 0 0 0 0 0 0 1149028 52759 35 36 29 > 2 920 0 1048576 26 0 425 0 104 0 0 0 0 0 0 1180505 55345 24 33 43 > 2 920 0 1048576 0 0 0 0 0 0 0 0 0 0 0 1204870 57131 29 26 45 > 5 917 0 1048576 28 0 0 0 1 0 0 0 0 0 0 1154900 53069 30 30 40 > 2 922 0 1048576 31 0 425 0 104 0 0 0 0 0 0 1232803 55362 27 35 38 > 3 921 0 1048576 30 0 0 0 0 0 0 0 0 0 0 1153011 47926 27 37 36 > 3 925 0 1048576 111 0 822 0 165 0 0 0 0 0 0 1165042 55808 34 39 27 > 2 926 0 1048576 24 0 85 0 18 0 0 0 0 0 0 1203487 46580 32 43 25 > 4 925 0 1048576 6 0 85 0 15 0 0 0 0 0 0 1215516 55999 23 35 42 > > ************************* > My thoughts: > (1)By vmstat's output,the "sy:system calls" and "cs:context switch" > are very high. Are these value normal? (The sco system is only an > endpoint-telnet-server, and the telnet users don't have much business, > only querying & charging--the database server is on third part) > (2)By vmstat's output,does the "cch" effect the system's performance? Syscalls/sec does seem high, but since the CPU is still 40% idle, that's not the problem. Context switches/sec is in line with syscalls. I think cch "pages in cache" refers to pages that have been marked for possible purging by the virtual memory sweeper, then were demonstrated (by a page fault) to still be in use. This is part of the normal functioning of the virtual memory system and the page rate looks reasonable, maybe even a bit low (not a worry). >Bela< |
| |||
| Bill Campbell wrote: > I have seen major problems with NICs which show high numbers of > errors on incoming and outgoing packets. On Linux systems the > ifconfig command shows the error history, but SCO's doesn't, at > least not on the OSR 5.0.6a systems we have here. netstat -i; ndstat -l >Bela< |
| |||
| Sorry for my absence of these days,I'm on a national holiday your rich knowledge and patience defeat me.... Thanks again. (1)Bela Lubkin wrote: > Bill Campbell wrote: > >> Does the system exhibit this type of performance on the console? >> If it doesn't, the problem is most likely network related. > > I have seen major problems with NICs which show high numbers of > > errors on incoming and outgoing packets. On Linux systems the > > ifconfig command shows the error history, but SCO's doesn't, at > > least not on the OSR 5.0.6a systems we have here. > > netstat -i; ndstat -l > > >Bela< Because the application needs operator ID and password to login, I can't test it on the console. But I'll try to do it later and feedback the result to you. The NIC should be ok, because the old application works fine. The following is the result of "netstat & ndstat": # netstat -i Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll net1 1500 142.70 zjyw-38 8460437 0 7113058 0 469448 lo0 8232 loopback localhost 2467255 0 2467255 0 0 atl0* 8232 none none No Statistics Available # ndstat Device MAC address in use Factory MAC Address ------ ------------------ ------------------- /dev/net1 00:15:60:a5:ac:80 00:15:60:a5:ac:80 Multicast address table ----------------------- 01:00:5e:00:00:01 FRAMES Unicast Multicast Broadcast Error Octets Queue Length ---------- --------- --------- ------ ----------- ------------ In: 7943453 0 517623 0 613619190 0 Out: 7113607 0 1 0 500415540 0 # ndstat -l Device MAC address in use Factory MAC Address ------ ------------------ ------------------- /dev/net1 00:15:60:a5:ac:80 00:15:60:a5:ac:80 Multicast address table ----------------------- 01:00:5e:00:00:01 FRAMES Unicast Multicast Broadcast Error Octets Queue Length ---------- --------- --------- ------ ----------- ------------ In: 7943990 0 517715 0 613689653 0 Out: 7114102 0 1 0 500466929 0 DLPI Module Info: 2 SAPs open, 18 SAPs maximum 5281 frames received destined for an unbound SAP MAC Driver Info: Media_type: Ethernet Min_SDU: 14, Max_SDU: 1514, Address length: 6 Interface speed: 10 Mbits/sec DLPI Restarts Info: Last queue size: 0 Last send time: 6080505 Restart in progress: 0 Number of restarts: 0 Interface Version: MDI 100 ETHERNET SPECIFIC STATISTICS Collision Table - The number of frames successfully transmitted, but involved in at least one collision: Frames Frames ------- ------- 1 collision 229269 9 collisions 125 2 collisions 55519 10 collisions 19 3 collisions 15181 11 collisions 2 4 collisions 9432 12 collisions 0 5 collisions 6231 13 collisions 0 6 collisions 1748 14 collisions 0 7 collisions 248 15 collisions 0 8 collisions 151 16 collisions 0 Bad Alignment 0 Number of frames received that were not an integral number of octets FCS Errors 0 Number of frames received that did not pass the Frame Check Sequence SQE Test Errors 0 Number of Signal Quality Error Test signals that were detected by the adapter Deferred Transmissions 118929 Number of frames delayed on the first transmission attempt because the media was busy Late Collisions 0 Number of times a collision was detected later than 512 bits into the transmitted frame Excessive Collisions 0 Number of frames dropped on transmission because of excessive collisions Internal MAC Transmit 0 Number of frames dropped on transmission Errors because of errors not covered above Carrier Sense Errors 0 Number of times that the carrier sense condition was lost when attempting to send a frame that was deferred for an excessive amount of time Frame Too Long 0 Number of frames dropped on reception because they were larger than the maximum Ethernet frame size Internal MAC Receive 0 Number of frames dropped on reception Errors because of errors not covered above Spurious Interrupts 0 Number of times the adapter interrupted the system for an unknown reason No STREAMS Buffers 0 Number of frames dropped on reception because no STREAMS buffers were available Underruns/Overruns 0 Number of times the transfer of data to or from the frame buffer did not complete successfully Device Timeouts 0 Number of times the adapter failed to respond to a request from the driver # (2)The filesystems are mostly free,so this shouldn't be the problem: # dfspace / : Disk space: 7434.21 MB of 8927.00 MB available (83.28%). /stand : Disk space: 2.41 MB of 14.99 MB available (16.12%). /serv : Disk space: 27516.99 MB of 29998.61 MB available (91.73%). /servbak : Disk space: 28590.46 MB of 29999.01 MB available (95.30%). (3) > Much more likely: you've got dirty buffer cache storms. You've given > the system 450MB of buffer cache. A process that was writing very > quickly to an already allocated file could dirty tens of megabytes in a > few seconds. Those blocks would stay in cache until bdflush was run, > then they would all try to write to disk at the same time, busying out > the disk for a long time. > > To mitigate this, change BDFLUSHR to 1 (run bdflush as often as > possible, once a second) and NAUTOUP to 2 (flush buffers that are no > more than 2 seconds old). This costs a bit of extra CPU, but your > system has plenty to spare. This shouldn't be the point. Because the application generates little writes, totally about 20M one day. (4) > I seem to remember that one of the OSR507 patches also improved some > buffer cache handling. With your 506, the system might actually run > _faster_ with a much smaller buffer cache. You should test it with a > sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't > help. I'll test it and feedback the result as soon as possible. (5) > Because of the buffer cache & filesystem space allocation improvements, > this system would probably be a lot happier under OSR507 + MP5. (Or it > might make no difference... can't really tell without trying.) I have no method... Because the software supplier says their application only support OSR506 and OSR505. |
| |||
| yannanqi@126.com wrote: > The NIC should be ok, because the old application > works fine. The following is the result of "netstat & ndstat": > > # netstat -i > Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll > net1 1500 142.70 zjyw-38 8460437 0 7113058 0 469448 > lo0 8232 loopback localhost 2467255 0 2467255 0 0 > atl0* 8232 none none No Statistics Available > # ndstat > MAC Driver Info: Media_type: Ethernet > Min_SDU: 14, Max_SDU: 1514, Address length: 6 > Interface speed: 10 Mbits/sec > Frames Frames > ------- ------- > 1 collision 229269 9 collisions 125 > 2 collisions 55519 10 collisions 19 > 3 collisions 15181 11 collisions 2 > 4 collisions 9432 12 collisions 0 > 5 collisions 6231 13 collisions 0 > 6 collisions 1748 14 collisions 0 > 7 collisions 248 15 collisions 0 > 8 collisions 151 16 collisions 0 > Deferred Transmissions 118929 Number of frames delayed on the > first transmission attempt because > the media was busy 6.5% collisions on output seems pretty high. For comparison, this system I'm looking at has sent 33 million packets, experiencing 0 collisions and 133 deferred transmissions. Of course it's the big fish on a pretty quiet LAN, and it's probably on a switch. High collisions can be a sign of: very busy network; bad cables; incorrect autodetection of duplex. You should put this system on a 100Mbps or 1Gbps network, preferably on a switch, and make sure it is set for or autodetecting the right duplex setting. You said the problem was telnet users having slow response. Interactive use exchanges one or more packets for every character typed by the user. With 6.5% collisions, every sentence they type is going to experience several collisions and the resulting back-off algorithm. I can imagine this causing the entire problem. > (4) > > I seem to remember that one of the OSR507 patches also improved some > > buffer cache handling. With your 506, the system might actually run > > _faster_ with a much smaller buffer cache. You should test it with a > > sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't > > help. > > I'll test it and feedback the result as soon as possible. Ok. Even with the net issues, I'm still suspicious about the 100% busy disk readings. Your buffer cache ratios are very high, disk shouldn't need to be busy. Is it a very old & slow disk? Swap in a fast disk. > (5) > > Because of the buffer cache & filesystem space allocation improvements, > > this system would probably be a lot happier under OSR507 + MP5. (Or it > > might make no difference... can't really tell without trying.) > > I have no method... Because the software supplier says their > application only support OSR506 and OSR505. What they mean is they can't be bothered to test with anything newer. It would probably be fine, backwards compatibility was/is SCO's core competency... >Bela< |
| |||
| It could be network or disk or both! Having Bela in here is truly awesome, I am not in Bela's league of understanding the inner workings of SCO but I have some practical generic advice for you to consider. The buffer cache flush daemon "bdflush" will be regularly flushing, when it does it is writing your (huge) buffer cache to disk. This could be responsible for the surges in disk i/o that you see. making your buffer cache smaller or more frequent flushing or a combination of both could help to smooth out the big data write tsunamis into smaller waves but looking at the underlying disk and/or RAID architectecture is also important. If your system is experiencing a situation where the activity that is generating i/o is very bursty and infrequent then a bigger cache could help deal with a slow disk but if the action is frequent or continuous then you really need a faster disk. The frequency of i/o bursts and the timing of the bdflush is also important but faster disks always help. You have a lot of collisions on your network - you need to deal with that too. That could be a range of issues but work through a process of elimination looking at things like the following........ a. Switch configuration (assuming you have managed switches with some layer3 capabilities) - I have found IGMP snooping turned ON will assist in managing broadcast traffic b. check the event logs on the switch looking at error counts by port and follow the trail to track down the source of the noise where the counts are highest. c. beef up the server to switch connection - make sure the data pipes are fat where data converges! Upgrade the server NIC to gigabit and push it into a gigabit port on your switch make sure the backbone of your network linking your LAN segments together has fat pipes too.... d. look at implementing some QoS for your telnet traffic if all of the above are fine |
| ||||
| Thanks for Bela and James's warm-hearted and constructional advice. I'll lookup them up one by one. About the disk's performance,ohh... The server is a HP DL380 G4, Raid 1, BDFLUSHR=30 and NAUTOUP=10, according to my inspection, the 100% busy doesn't seem to be caused by the bdflush.. But I'm not sure. May be really a hardware bottleneck ... No matter what, I'll take your advice to heart and try to do something, then I'll feedback the results to you, but this may take some days because of the production environment. Thanks to Bela,James and Bill again. Best regards for you. |