Unix Technical Forum

help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

This is a discussion on help analyzing low system(with sar/vmstat/u386mon/sarcheck data) within the Sco Unix forums, part of the Unix Operating Systems category; --> system configuration: sco 5.0.6, with about 170 ttys loggedin by telnet, two 2G cpu, 4G memory (1) output of ...


Go Back   Unix Technical Forum > Unix Operating Systems > Sco Unix

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-03-2008, 03:45 PM
yannanqi@126.com
 
Posts: n/a
Default help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

system configuration: sco 5.0.6, with about 170 ttys loggedin by
telnet, two 2G cpu, 4G memory

(1) output of sar -A:
SCO_SV zjyw-38 3.2v5.0.6 i80386 03/31/2008

09:01:04 %usr %sys %wio %idle (-u)
bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/
s (-b)
device %busy avque r+w/s blks/s avwait
avserv (-d)
c_hits cmisses (hit %) (-n)
rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s (-y)
scall/s sread/s swrit/s fork/s exec/s rchar/s wchar/s (-
c)
swpin/s bswin/s swpot/s bswot/s pswch/s (-w)
iget/s namei/s dirbk/s (-a)
runq-sz %runocc swpq-sz %swpocc (-q)
proc-sz ov inod-sz ov file-sz ov lock-sz (-v)
msg/s sema/s (-m)
vflt/s pflt/s pgfil/s rclm/s (-p)
freemem freeswp availrmem availsmem (-r)
cpybuf/s slpcpybuf/s (-B)
dptch/s idler/s swidle/s (-R)
ovsiohw/s ovsiodma/s ovclist/s (-g)
mpbuf/s ompb/s mphbuf/s omphbuf/s pbuf/s spbuf/s dmabuf/s
sdmabuf/s (-h)

Average 6 20 2 72 (-u)
Average 6 193153 100 68 1071 94 0
0 (-b)
Average Sdsk-0 100.00 1.00 22.86 148.17 0.00
57.19 (-d)
Average 453343 8647 (98%) (-n)
Average 25 1 5553 0 0 0 (-y)
Average 241413 175778 7502 3.37 3.48 1171261 72597 (-
c)
Average 0.00 0.0 0.00 0.0 951 (-w)
Average 7614 990 1768 (-a)
Average 2.1 100 (-q)
Average 0.00 0.00 (-m)
Average 76.83 158.98 0.05 0.00 (-p)
Average 611232 1048576 799826 513961 (-r)
Average 0.00 0.00 (-B)
Average 2707.10 376.22 45.88 (-R)
Average 0.00 0.00 0.00 (-g)
Average 0.04 0.00 16.64 0.00 0.00 0.00
0.00 0.00 (-h)

(2) output of sar:
# sar -r 1 10

SCO_SV zjyw-38 3.2v5.0.6 i80386 03/31/2008

08:46:41 freemem freeswp availrmem availsmem (-r)
08:46:42 680115 1048576 802064 648186
08:46:43 680048 1048576 802062 648143
08:46:44 679997 1048576 802062 648143

# sar -w 1 10

SCO_SV zjyw-38 3.2v5.0.6 i80386 04/02/2008

11:48:43 swpin/s bswin/s swpot/s bswot/s pswch/s (-w)
11:48:44 0.00 0.0 0.00 0.0 1383
11:48:45 0.00 0.0 0.00 0.0 1321
11:48:46 0.00 0.0 0.00 0.0 1417
11:48:47 0.00 0.0 0.00 0.0 1247

sar -p:
08:29:21 vflt/s pflt/s pgfil/s rclm/s (-p)
08:29:22 541.18 1567.65 0.00 0.00
08:29:23 197.06 109.80 0.00 0.00
08:29:24 44.55 41.58 0.00 0.00
08:29:25 50.98 134.31 0.00 0.00
08:29:26 85.15 388.12 0.00 0.00
08:29:27 111.76 358.82 0.00 0.00
08:29:28 534.31 726.47 0.00 0.00
08:29:29 216.67 131.37 0.00 0.00
08:29:30 290.10 550.50 0.00 0.00
08:29:31 244.12 138.24 0.00 0.00
08:29:32 33.98 113.59 0.00 0.00
08:29:33 103.96 279.21 0.00 0.00

(3) output of vmstat:
PROCS PAGING SYSTEM CPU
r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us
su id
1 743 0 1048576 382 0 1222 0 783 0 0 0 0 0 0 170730
547 11 14 75
1 743 0 1048576 0 0 0 0 65 0 0 0 0 0 0 123108
583 2 11 87
4 737 0 1048576 13 0 0 0 132 0 0 0 0 0 0 275090
695 8 29 63
3 738 0 1048576 491 0 1080 0 439 0 0 0 0 0 0 358404
800 5 35 60
3 738 0 1048576 13 0 0 0 41 0 0 0 0 0 0 512184
741 12 37 51
3 739 0 1048576 76 0 571 0 269 0 0 0 0 0 0 208337
755 7 25 68
3 740 0 1048576 9 0 198 0 117 0 0 0 0 0 0 283185
662 10 18 72
4 739 0 1048576 10 0 2 0 49 0 0 0 0 0 0 248484
684 10 15 75
2 737 0 1048576 203 0 3 0 125 0 0 0 0 0 0 277137
615 8 27 65
2 737 0 1048576 28 0 2 0 2 0 0 0 0 0 0 378153
616 8 32 60
2 739 0 1048576 644 0 4015 0 1149 0 0 0 0 0 0 92687
882 6 17 77
4 739 0 1048576 244 0 1222 0 672 0 0 0 0 0 0 152569
814 10 13 77
1 742 0 1048576 465 0 3632 0 1191 0 0 0 0 0 0 242385
902 11 28 61
2 743 0 1048576 407 0 1572 0 957 0 0 0 0 0 0 157772
531 9 15 76

(4)u386mon's output:


u386mon 2.74/SCO 3.2 - zjyw-38 15:18:49
wht@n4hgf
---- CPU --- tot usr ker brk
---------------------------------------------------
2 Sec Avg % 30 8 22 0
uuuukkkkkkkkkkk
10 Sec Avg % 32 6 26 0
uuukkkkkkkkkkkkk
20 Sec Avg % 30 5 25 0
uukkkkkkkkkkkk
---- Wait -- tot io pio swp -- (% of real time)
-------------------------------
2 Sec Avg % 11 11 0 0
iiiii
10 Sec Avg % 9 9 0 0
iiii
20 Sec Avg % 6 6 0 0
iii
---- Sysinfo/Minfo --- (last 2031 msec activity)
------------------------------
bread 2 readch 51780167 pswitch 1555 vfault 381
unmodsw 0
bwrite 54 writch 171667 syscall 190057 demand 381
unmodfl 0
lread 388468 rawch 94 sysread 175629 pfault 289
psoutok 0
lwrite 6884 canch 6 syswrit 3620 cw 189
psinfai 0
phread 0 outch 7420 sysfork 7 steal 100
psinok 0
phwrite 0 msg 0 sysexec 7 frdpgs 0
rsout 0
swapin 0 sema 0 vfpg 0
rsin 0
swapout 0 maxmem -1080464krunque 0 sfpg 0
bswapin 0 frmem -1688772krunocc 0 vspg 0
pages on
bswapout 0 mem used 20% swpque 0 sspg 0
swap 0
iget 14417 nswap 524288k swpocc 0 pnpfault 0
cache 992
namei 1795 frswp 524288k wrtfault 0
file 0
dirblk 3423 swp used 0%



---- Sysinfo/Minfo --- (last 2041 msec activity)
------------------------------
bread 0 readch 89470697 pswitch 2339 vfault 208
unmodsw 0
bwrite 0 writch 30924 syscall 293135 demand 206
unmodfl 0
lread 531488 rawch 84 sysread 223028 pfault 238
psoutok 0
lwrite 338 canch 1 syswrit 708 cw 84
psinfai 0
phread 0 outch 10410 sysfork 3 steal 154
psinok 0
phwrite 0 msg 0 sysexec 4 frdpgs 0
rsout 0
swapin 0 sema 0 vfpg 0
rsin 0
swapout 0 maxmem -1080464krunque 1 sfpg 0
bswapin 0 frmem -1680708krunocc 1 vspg 0
pages on
bswapout 0 mem used 20% swpque 0 sspg 0
swap 0
iget 2685 nswap 524288k swpocc 0 pnpfault 0
cache 455
namei 775 frswp 524288k wrtfault 0
file 0

(5) part of output of sarcheck:
The following indication(s) of a memory shortage were seen: The
reclaim
rate was at least one quarter of the page fault rate in only 0.0
percent
of the samples. This statistic can be used to confirm the
presence of
an occasional memory-poor condition.

The average swap out transfer request rate was 1768.3 per second,
which
is an indication of a memory-poor condition.

The amount of freeswp did not change during the monitoring
period,
indicating that the system has plenty of memory installed.

The average number of free pages usually did not stray far above
the
value of GPGSHI. This indicates that vhand, the page stealing
daemon,
was usually active and the memory poor condition seen on this
system has
resulted in increased CPU overhead as well as additional disk
activity.

Both GPGSHI and GPGSLO were set to high values, relative to the
amount
of memory present. Since paging was seen and these parameters are
set
in a way that increases the activity of the page stealing vhand
daemon,
consider lowering the values of GPGSHI and GPGSLO. The
difference
between GPGSLO and GPGSHI is large. This may create a CPU
bottleneck
while a large amount of dirty pages are being written to disk.

***********
My questions are:
(1)sarcheck's output: "The following indication(s) of a memory
shortage were seen: The reclaim
rate was at least one quarter of the page fault rate in only 0.0
percent
of the samples. This statistic can be used to confirm the
presence of
an occasional memory-poor condition."
--> What does this statement mean?
(2)sarcheck's output: "The average swap out transfer request rate was
1768.3 per second, which
is an indication of a memory-poor condition."
-->How is the number 1768.3 calculated out? According to the sar
and vmstat's output, there seems to be no swap, why does sarcheck say
"The average swap out transfer request rate was 1768.3 per second" and
"there is memory-poor condition"?
(3)sarcheck's output: "The average number of free pages usually did
not stray far above the
value of GPGSHI."
-->GPGSHI's value is 6000, and according the output of sar-
r:freemem 680115 is significantly
higher than the value of GPGSHI. Why sarcheck's conclusion is
opposite?
(4)Is the output of sar-p normal? Is vflt or pflt too large?
(5)Is the output of vmstat normal? Is sy or cs too large?
(6)In the u386mon's output,steal is not zero,Why? System's freemem
never fall below GPGSLO.

Sorry for so many questions, and appreciate for anyone's advice and
help
best regards for all
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-03-2008, 03:45 PM
Bela Lubkin
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

yannanqi@126.com wrote:

> system configuration: sco 5.0.6, with about 170 ttys loggedin by
> telnet, two 2G cpu, 4G memory


A buncha other stuff, in ugly format, not worth trying to edit for
quoting.

Your sar output looks reasonable for a system as described. So do the
other utils (modulo a few display bugs in u386mon). The system isn't
swapping at all and has loads more memory than it needs. sarcheck looks
like it isn't prepared to deal with some details of the sar outputs --
is it the latest sarcheck for OSR506?

The big thing missing in all that output is your description of what's
wrong. It looks like a system that has a lot of work to do and is doing
it without complaint. Plus some spurious nonsense from sarcheck. If
the whole problem is the advice from sarcheck, ignore it (ask them for
advice, though...)

The one possibly questionable stat is that the disk is 100% busy. But
you posted a snapshot, we can't tell if that was a momentary burst or
continuous. If it's continuous, the system might benefit from a faster
disk subsystem (faster drive, faster HBA, maybe an external RAID of the
sort that's intended to speed things up rather than or in addition to
giving redundancy -- RAID 0 or RAID 10). Although it's 100% busy, the
delay stats didn't look bad, so I'm not sure if this relates to your
issue.

If there's an actual performance problem, why don't you describe it
instead of posting a morass of details that don't seem to show much
wrong?

In your other message about NBUF:

> On OSR506 platform with 4G memory, the mtune shows:NBUF
> 0 24 450000,that means the maximum value of NBUF is
> 450000,but if I give 1000000 to NBUF,when system starts,it give the
> following message:
>
> kernel: Hz = 100, i/o bufs = 467116k (high bufs = 466092k)CONFIG:
> Buffer allocation was reduced (NBUF reduced to 467116)
>
> (1)That means NBUF gets a value of 467116, where does this number come
> from?


I would guess that 450000 was someone's back-of-napkin calculation
of the most buffers that could guaranteed to be accomodated within
the constraints of other kernel structures. When you demand 1000000
buffers, you cause the kernel to do a live calculation of the same
constraints, only now it has more specific information about certain
structures whose sizes are system-specific. Some of the constraints on
your system aren't quite the theoretical limits, so it can squeeze in a
few more buffers.

You should expect that by demanding the absolute maximum buffers, you
may be invisibly squeezing down the size of other kernel structures.
This could potentially hurt performance or stability. (I'm not saying
that it _does_ hurt, I don't really know.) You can also reasonably
expect that SCO _tested_ with 450000 buffers but not with 467116. I
I doubt the 3.8% increase in buffers is making so much difference in
performance that it's worth running in an untested configuration.

> ps:
> (2) If NBUF has a value other than zero, Is it ok to let NHBUF=0? Can
> NHBUF self-tune according to NBUF when NBUF is not set to zero?


It should auto-tune. You can observe runtime values of these by doing:

# crash
> v | grep buf

v_buf: 450000
v_hbuf: 524288

If you boot with different forced NBUF (v_buf) values, you should see
v_hbuf (NHBUF) float to different values. It's always a power of 2 so
you'll have to make sharp changes to NBUF to see NHBUF change.

> (3)When does MAXBUF have effect, when NBUF is zero or NBUF is not zero?


MAXBUF is an obsolete parameter, no longer edited by configure(ADM), no
longer meaningful to the kernel.

>Bela<

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-03-2008, 03:45 PM
yannanqi@126.com
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

Bela,I can't express my heart by words.Only one word:you're
great,great thanks! You freeed me through clear explanation.

But the sco system really encounters performance problem: the telnet
users' working interface is very slow,the items of the dropdown list
fields slowly appears one by one.

(1)sar -b: The %rcache and %wcache seem to be normal.
SCO_SV zjyw-38 3.2v5.0.6 i80386 04/03/2008

14:25:10 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/
s (-b)
14:25:11 18 36849 100 388 5282 93 0
0
14:25:12 3 158264 100 33 316 89 0
0
14:25:13 0 59686 100 0 226 100 0
0
14:25:14 0 79142 100 0 159 100 0
0
14:25:15 0 164502 100 0 50 100 0
0
14:25:16 0 169043 100 0 181 100 0
0
14:25:17 0 61087 100 0 7 100 0
0
14:25:18 0 15037 100 0 16 100 0
0
14:25:19 0 230439 100 0 102 100 0
0
14:25:20 0 55642 100 0 61 100 0
0
14:25:21 0 37027 100 0 12 100 0
0
14:25:22 1 127536 100 0 43 100 0
0
14:25:23 0 133101 100 18 122 85 0
0
14:25:24 0 17444 100 1 2 60 0
0
14:25:25 0 0 0 0 0 100 0
0
14:25:26 0 7142 100 1 12 92 0
0
14:25:27 0 3 100 4 4 11 0
0
14:25:28 0 146721 100 0 79 100 0
0
14:25:29 0 8179 100 0 37 100 0
0
14:25:30 0 175348 100 0 37 100 0
0
14:25:31 0 98968 100 0 81 100 0
0
14:25:32 0 67449 100 0 26 100 0
0
14:25:33 0 66537 100 0 27 100 0
0
14:25:34 0 19567 100 0 8 100 0
0
14:25:35 0 99711 100 0 31 100 0
0
14:25:36 0 45507 100 0 86 100 0
0
14:25:37 0 98409 100 0 34 100 0
0
14:25:39 0 85748 100 10 80 88 0
0
14:25:40 130 156812 100 6 5129 100 0
0
14:25:41 0 14653 100 421 143 0 0
0
14:25:42 0 431218 100 0 284 100 0
0
14:25:43 0 26278 100 0 81 100 0
0
14:25:44 0 77340 100 0 116 100 0
0
14:25:45 0 18695 100 0 18 100 0
0
14:25:46 0 21389 100 0 20 100 0
0
14:25:47 0 149728 100 11 68 84 0
0
14:25:48 0 1027 100 0 56 100 0
0

(2) sar -d:
14:24:48 Sdsk-0 4.95 1.00 7.92 15.84 0.00
6.25

14:24:49
14:24:50
14:24:51
14:24:52
14:24:53
14:24:54 Sdsk-0 1.98 1.00 4.95 9.90 0.00
4.00

14:24:55
14:24:56
14:24:57 Sdsk-0 10.00 1.00 13.00 26.00 0.00
7.69

14:24:58 Sdsk-0 100.00 1.00 53.92 178.43 0.00
21.82

14:24:59 Sdsk-0 100.00 1.00 240.59 1976.24 0.00
44.07

14:25:00 Sdsk-0 0.99 1.00 0.99 1.98 0.00
10.00

14:25:01
14:25:02
14:25:03
14:25:04
14:25:05
14:25:06
14:25:07
14:25:08
14:25:09 Sdsk-0 0.99 1.00 0.99 9.90 0.00
10.00

14:25:10
14:25:11 Sdsk-0 100.00 1.00 134.65 857.43 0.00
76.25

14:25:12 Sdsk-0 21.78 1.00 6.93 27.72 0.00
31.43

14:25:13
14:25:14
14:25:15
14:25:16
14:25:17
14:25:18
14:25:19
14:25:20
14:25:21 Sdsk-0 1.00 1.00 1.00 2.00 0.00
10.00

14:25:22
14:25:23 Sdsk-0 17.65 1.00 18.63 37.25 0.00
9.47

(3)vmstat:
Thu Mar 27 16:23:31 CST 2008
# vmstat 1 100

PROCS PAGING SYSTEM CPU
r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us
su id

2 921 0 1048576 8 0 0 0 2 0 0 0 0 0 0 1128585
54570 28 27 45
2 921 0 1048576 62 0 0 0 0 0 0 0 0 0 0 1174108
57445 20 32 48
2 918 0 1048576 5 0 0 0 14 0 0 0 0 0 0 1149028
52759 35 36 29
2 920 0 1048576 26 0 425 0 104 0 0 0 0 0 0 1180505
55345 24 33 43
2 920 0 1048576 0 0 0 0 0 0 0 0 0 0 0 1204870
57131 29 26 45
5 917 0 1048576 28 0 0 0 1 0 0 0 0 0 0 1154900
53069 30 30 40
2 922 0 1048576 31 0 425 0 104 0 0 0 0 0 0 1232803
55362 27 35 38
3 921 0 1048576 30 0 0 0 0 0 0 0 0 0 0 1153011
47926 27 37 36
3 925 0 1048576 111 0 822 0 165 0 0 0 0 0 0 1165042
55808 34 39 27
2 926 0 1048576 24 0 85 0 18 0 0 0 0 0 0 1203487
46580 32 43 25
4 925 0 1048576 6 0 85 0 15 0 0 0 0 0 0 1215516
55999 23 35 42

*************************
My thoughts:
(1)By vmstat's output,the "sy:system calls" and "cs:context switch"
are very high. Are these value normal? (The sco system is only an
endpoint-telnet-server, and the telnet users don't have much business,
only querying & charging--the database server is on third part)
(2)By vmstat's output,does the "cch" effect the system's performance?
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-07-2008, 09:34 AM
Bill Campbell
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

On Wed, Apr 02, 2008, yannanqi@126.com wrote:
>Bela,I can't express my heart by words.Only one word:you're
>great,great thanks! You freeed me through clear explanation.
>
>But the sco system really encounters performance problem: the telnet
>users' working interface is very slow,the items of the dropdown list
>fields slowly appears one by one.


Does the system exhibit this type of performance on the console?
If it doesn't, the problem is most likely network related.

If it is a network problem, it could be a bad NIC, network
switch, hub, or even another machine on the LAN with the same IP
address as the server. DNS problems usually show up with long
initial connection times as the system attempts to resolve the
host name of the connecting IP.

I have seen major problems with NICs which show high numbers of
errors on incoming and outgoing packets. On Linux systems the
ifconfig command shows the error history, but SCO's doesn't, at
least not on the OSR 5.0.6a systems we have here.

Bill
--
INTERNET: bill@celestial.com Bill Campbell; Celestial Software LLC
URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way
FAX: (206) 232-9186 Mercer Island, WA 98040-0820; (206) 236-1676

That rifle on the wall of the labourer's cottage or working class flat is
the symbol of democracy. It is our job to see that it stays there.
--GEORGE ORWELL
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-07-2008, 09:34 AM
Bela Lubkin
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

yannanqi@126.com wrote:

> But the sco system really encounters performance problem: the telnet
> users' working interface is very slow,the items of the dropdown list
> fields slowly appears one by one.
>
> (1)sar -b: The %rcache and %wcache seem to be normal.


They're actually exceptionally high (since you have so much buffer
cache). Shouldn't be causing a performance problem.

> (2) sar -d:
> 14:24:48 Sdsk-0 4.95 1.00 7.92 15.84 0.00 6.25
> 14:24:49
> 14:24:50
> 14:24:51
> 14:24:52
> 14:24:53
> 14:24:54 Sdsk-0 1.98 1.00 4.95 9.90 0.00 4.00
> 14:24:55
> 14:24:56
> 14:24:57 Sdsk-0 10.00 1.00 13.00 26.00 0.00 7.69
> 14:24:58 Sdsk-0 100.00 1.00 53.92 178.43 0.00 21.82
> 14:24:59 Sdsk-0 100.00 1.00 240.59 1976.24 0.00 44.07
> 14:25:00 Sdsk-0 0.99 1.00 0.99 1.98 0.00 10.00
> 14:25:01
> 14:25:02


Alternating 100% busy and idle, hmmm.

How full are the filesystems? HTFS on OSR506 (and OSR507 without at
least MP3 or so) was extremely inefficient at allocating space on
nearly-full filesystems. On large filesystems (100GiB would be large
enough), this inefficiency was costly in both CPU and disk I/O terms.

But your buffer cache stats suggest this is not the problem.

Much more likely: you've got dirty buffer cache storms. You've given
the system 450MB of buffer cache. A process that was writing very
quickly to an already allocated file could dirty tens of megabytes in a
few seconds. Those blocks would stay in cache until bdflush was run,
then they would all try to write to disk at the same time, busying out
the disk for a long time.

To mitigate this, change BDFLUSHR to 1 (run bdflush as often as
possible, once a second) and NAUTOUP to 2 (flush buffers that are no
more than 2 seconds old). This costs a bit of extra CPU, but your
system has plenty to spare.

I seem to remember that one of the OSR507 patches also improved some
buffer cache handling. With your 506, the system might actually run
_faster_ with a much smaller buffer cache. You should test it with a
sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't
help.

Because of the buffer cache & filesystem space allocation improvements,
this system would probably be a lot happier under OSR507 + MP5. (Or it
might make no difference... can't really tell without trying.)

> (3)vmstat:
> Thu Mar 27 16:23:31 CST 2008
> # vmstat 1 100
>
> PROCS PAGING SYSTEM CPU
> r b w frs dmd sw cch fil pft frp pos pif pis rso rsi sy cs us su id
>
> 2 921 0 1048576 8 0 0 0 2 0 0 0 0 0 0 1128585 54570 28 27 45
> 2 921 0 1048576 62 0 0 0 0 0 0 0 0 0 0 1174108 57445 20 32 48
> 2 918 0 1048576 5 0 0 0 14 0 0 0 0 0 0 1149028 52759 35 36 29
> 2 920 0 1048576 26 0 425 0 104 0 0 0 0 0 0 1180505 55345 24 33 43
> 2 920 0 1048576 0 0 0 0 0 0 0 0 0 0 0 1204870 57131 29 26 45
> 5 917 0 1048576 28 0 0 0 1 0 0 0 0 0 0 1154900 53069 30 30 40
> 2 922 0 1048576 31 0 425 0 104 0 0 0 0 0 0 1232803 55362 27 35 38
> 3 921 0 1048576 30 0 0 0 0 0 0 0 0 0 0 1153011 47926 27 37 36
> 3 925 0 1048576 111 0 822 0 165 0 0 0 0 0 0 1165042 55808 34 39 27
> 2 926 0 1048576 24 0 85 0 18 0 0 0 0 0 0 1203487 46580 32 43 25
> 4 925 0 1048576 6 0 85 0 15 0 0 0 0 0 0 1215516 55999 23 35 42
>
> *************************
> My thoughts:
> (1)By vmstat's output,the "sy:system calls" and "cs:context switch"
> are very high. Are these value normal? (The sco system is only an
> endpoint-telnet-server, and the telnet users don't have much business,
> only querying & charging--the database server is on third part)
> (2)By vmstat's output,does the "cch" effect the system's performance?


Syscalls/sec does seem high, but since the CPU is still 40% idle, that's
not the problem. Context switches/sec is in line with syscalls.

I think cch "pages in cache" refers to pages that have been marked for
possible purging by the virtual memory sweeper, then were demonstrated
(by a page fault) to still be in use. This is part of the normal
functioning of the virtual memory system and the page rate looks
reasonable, maybe even a bit low (not a worry).

>Bela<

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-07-2008, 09:34 AM
Bela Lubkin
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

Bill Campbell wrote:

> I have seen major problems with NICs which show high numbers of
> errors on incoming and outgoing packets. On Linux systems the
> ifconfig command shows the error history, but SCO's doesn't, at
> least not on the OSR 5.0.6a systems we have here.


netstat -i; ndstat -l

>Bela<

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-07-2008, 09:35 AM
yannanqi@126.com
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

Sorry for my absence of these days,I'm on a national holiday and
your rich knowledge and patience defeat me.... Thanks again.

(1)Bela Lubkin wrote:
> Bill Campbell wrote:
>
>> Does the system exhibit this type of performance on the console?
>> If it doesn't, the problem is most likely network related.


> > I have seen major problems with NICs which show high numbers of
> > errors on incoming and outgoing packets. On Linux systems the
> > ifconfig command shows the error history, but SCO's doesn't, at
> > least not on the OSR 5.0.6a systems we have here.

>
> netstat -i; ndstat -l
>
> >Bela<


Because the application needs operator ID and password to login, I
can't test it on the console. But I'll try to do it later and feedback
the result to you. The NIC should be ok, because the old application
works fine. The following is the result of "netstat & ndstat":

# netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs
Coll
net1 1500 142.70 zjyw-38 8460437 0 7113058 0
469448
lo0 8232 loopback localhost 2467255 0 2467255 0
0
atl0* 8232 none none No Statistics Available

# ndstat
Device MAC address in use Factory MAC Address
------ ------------------ -------------------
/dev/net1 00:15:60:a5:ac:80 00:15:60:a5:ac:80

Multicast address table
-----------------------
01:00:5e:00:00:01

FRAMES
Unicast Multicast Broadcast Error Octets Queue Length
---------- --------- --------- ------ ----------- ------------
In: 7943453 0 517623 0 613619190 0
Out: 7113607 0 1 0 500415540 0

# ndstat -l
Device MAC address in use Factory MAC Address
------ ------------------ -------------------
/dev/net1 00:15:60:a5:ac:80 00:15:60:a5:ac:80

Multicast address table
-----------------------
01:00:5e:00:00:01

FRAMES
Unicast Multicast Broadcast Error Octets Queue Length
---------- --------- --------- ------ ----------- ------------
In: 7943990 0 517715 0 613689653 0
Out: 7114102 0 1 0 500466929 0

DLPI Module Info: 2 SAPs open, 18 SAPs maximum
5281 frames received destined for an unbound SAP

MAC Driver Info: Media_type: Ethernet
Min_SDU: 14, Max_SDU: 1514, Address length: 6
Interface speed: 10 Mbits/sec

DLPI Restarts Info: Last queue size: 0
Last send time: 6080505
Restart in progress: 0
Number of restarts: 0

Interface Version: MDI 100

ETHERNET SPECIFIC STATISTICS

Collision Table - The number of frames successfully transmitted,
but involved in at least one collision:

Frames Frames
------- -------
1 collision 229269 9 collisions 125
2 collisions 55519 10 collisions 19
3 collisions 15181 11 collisions 2
4 collisions 9432 12 collisions 0
5 collisions 6231 13 collisions 0
6 collisions 1748 14 collisions 0
7 collisions 248 15 collisions 0
8 collisions 151 16 collisions 0


Bad Alignment 0 Number of frames received that
were
not an integral number of octets

FCS Errors 0 Number of frames received that
did
not pass the Frame Check Sequence

SQE Test Errors 0 Number of Signal Quality Error
Test
signals that were detected by the
adapter

Deferred Transmissions 118929 Number of frames delayed on the
first transmission attempt
because
the media was busy

Late Collisions 0 Number of times a collision was
detected later than 512 bits into
the transmitted frame

Excessive Collisions 0 Number of frames dropped on
transmission
because of excessive collisions

Internal MAC Transmit 0 Number of frames dropped on
transmission
Errors because of errors not covered
above

Carrier Sense Errors 0 Number of times that the carrier
sense
condition was lost when
attempting to
send a frame that was deferred
for an
excessive amount of time

Frame Too Long 0 Number of frames dropped on
reception
because they were larger than the
maximum Ethernet frame size

Internal MAC Receive 0 Number of frames dropped on
reception
Errors because of errors not covered
above

Spurious Interrupts 0 Number of times the adapter
interrupted
the system for an unknown reason

No STREAMS Buffers 0 Number of frames dropped on
reception
because no STREAMS buffers were
available

Underruns/Overruns 0 Number of times the transfer of
data to or from the frame buffer
did not complete successfully

Device Timeouts 0 Number of times the adapter
failed to
respond to a request from the
driver
#

(2)The filesystems are mostly free,so this shouldn't be the problem:
# dfspace
/ : Disk space: 7434.21 MB of 8927.00 MB available
(83.28%).
/stand : Disk space: 2.41 MB of 14.99 MB available (16.12%).
/serv : Disk space: 27516.99 MB of 29998.61 MB available
(91.73%).
/servbak : Disk space: 28590.46 MB of 29999.01 MB available
(95.30%).

(3)
> Much more likely: you've got dirty buffer cache storms. You've given
> the system 450MB of buffer cache. A process that was writing very
> quickly to an already allocated file could dirty tens of megabytes in a
> few seconds. Those blocks would stay in cache until bdflush was run,
> then they would all try to write to disk at the same time, busying out
> the disk for a long time.
>
> To mitigate this, change BDFLUSHR to 1 (run bdflush as often as
> possible, once a second) and NAUTOUP to 2 (flush buffers that are no
> more than 2 seconds old). This costs a bit of extra CPU, but your
> system has plenty to spare.


This shouldn't be the point. Because the application generates little
writes, totally about 20M one day.

(4)
> I seem to remember that one of the OSR507 patches also improved some
> buffer cache handling. With your 506, the system might actually run
> _faster_ with a much smaller buffer cache. You should test it with a
> sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't
> help.


I'll test it and feedback the result as soon as possible.

(5)
> Because of the buffer cache & filesystem space allocation improvements,
> this system would probably be a lot happier under OSR507 + MP5. (Or it
> might make no difference... can't really tell without trying.)


I have no method... Because the software supplier says their
application only support OSR506 and OSR505.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-07-2008, 09:35 AM
Bela Lubkin
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

yannanqi@126.com wrote:

> The NIC should be ok, because the old application
> works fine. The following is the result of "netstat & ndstat":
>
> # netstat -i
> Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
> net1 1500 142.70 zjyw-38 8460437 0 7113058 0 469448
> lo0 8232 loopback localhost 2467255 0 2467255 0 0
> atl0* 8232 none none No Statistics Available


> # ndstat
> MAC Driver Info: Media_type: Ethernet
> Min_SDU: 14, Max_SDU: 1514, Address length: 6
> Interface speed: 10 Mbits/sec


> Frames Frames
> ------- -------
> 1 collision 229269 9 collisions 125
> 2 collisions 55519 10 collisions 19
> 3 collisions 15181 11 collisions 2
> 4 collisions 9432 12 collisions 0
> 5 collisions 6231 13 collisions 0
> 6 collisions 1748 14 collisions 0
> 7 collisions 248 15 collisions 0
> 8 collisions 151 16 collisions 0


> Deferred Transmissions 118929 Number of frames delayed on the
> first transmission attempt because
> the media was busy


6.5% collisions on output seems pretty high.

For comparison, this system I'm looking at has sent 33 million packets,
experiencing 0 collisions and 133 deferred transmissions. Of course
it's the big fish on a pretty quiet LAN, and it's probably on a switch.

High collisions can be a sign of: very busy network; bad cables;
incorrect autodetection of duplex. You should put this system on a
100Mbps or 1Gbps network, preferably on a switch, and make sure it is
set for or autodetecting the right duplex setting.

You said the problem was telnet users having slow response. Interactive
use exchanges one or more packets for every character typed by the user.
With 6.5% collisions, every sentence they type is going to experience
several collisions and the resulting back-off algorithm. I can imagine
this causing the entire problem.

> (4)
> > I seem to remember that one of the OSR507 patches also improved some
> > buffer cache handling. With your 506, the system might actually run
> > _faster_ with a much smaller buffer cache. You should test it with a
> > sharp reduction, e.g. NBUF=50000; revert back to 450000 if it doesn't
> > help.

>
> I'll test it and feedback the result as soon as possible.


Ok. Even with the net issues, I'm still suspicious about the 100% busy
disk readings. Your buffer cache ratios are very high, disk shouldn't
need to be busy. Is it a very old & slow disk? Swap in a fast disk.

> (5)
> > Because of the buffer cache & filesystem space allocation improvements,
> > this system would probably be a lot happier under OSR507 + MP5. (Or it
> > might make no difference... can't really tell without trying.)

>
> I have no method... Because the software supplier says their
> application only support OSR506 and OSR505.


What they mean is they can't be bothered to test with anything newer.
It would probably be fine, backwards compatibility was/is SCO's core
competency...

>Bela<

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-07-2008, 09:35 AM
James_Szabadics
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

It could be network or disk or both! Having Bela in here is truly
awesome, I am not in Bela's league of understanding the inner workings
of SCO but I have some practical generic advice for you to consider.

The buffer cache flush daemon "bdflush" will be regularly flushing,
when it does it is writing your (huge) buffer cache to disk. This
could be responsible for the surges in disk i/o that you see.

making your buffer cache smaller or more frequent flushing or a
combination of both could help to smooth out the big data write
tsunamis into smaller waves but looking at the underlying disk and/or
RAID architectecture is also important. If your system is
experiencing a situation where the activity that is generating i/o is
very bursty and infrequent then a bigger cache could help deal with a
slow disk but if the action is frequent or continuous then you really
need a faster disk. The frequency of i/o bursts and the timing of the
bdflush is also important but faster disks always help.

You have a lot of collisions on your network - you need to deal with
that too. That could be a range of issues but work through a process
of elimination looking at things like the following........

a. Switch configuration (assuming you have managed switches with
some layer3 capabilities) - I have found IGMP snooping turned ON will
assist in managing broadcast traffic
b. check the event logs on the switch looking at error counts by
port and follow the trail to track down the source of the noise where
the counts are highest.
c. beef up the server to switch connection - make sure the data
pipes are fat where data converges! Upgrade the server NIC to gigabit
and push it into a gigabit port on your switch make sure the backbone
of your network linking your LAN segments together has fat pipes
too....
d. look at implementing some QoS for your telnet traffic if all of
the above are fine


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-07-2008, 11:42 AM
yannanqi@126.com
 
Posts: n/a
Default Re: help analyzing low system(with sar/vmstat/u386mon/sarcheck data)

Thanks for Bela and James's warm-hearted and constructional advice.
I'll lookup them up one by one.
About the disk's performance,ohh... The server is a HP DL380 G4, Raid
1, BDFLUSHR=30 and NAUTOUP=10, according to my inspection, the 100%
busy doesn't seem to be caused by the bdflush.. But I'm not sure. May
be really a hardware bottleneck ...

No matter what, I'll take your advice to heart and try to do
something, then I'll feedback the results to you, but this may take
some days because of the production environment.

Thanks to Bela,James and Bill again. Best regards for you.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 07:07 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com