View Single Post

   
  #1 (permalink)  
Old 01-06-2008, 07:15 PM
John_B
 
Posts: n/a
Default Need assistance locating I/O bottleneck - lsof help, perhaps?

We have an SF6800 server w/ 20 CPUs and a boatload of A5x00s - over 150
disks. We've recently started experiencing *severe* I/O degredation.

After running guds and forwarding that information to Sun, two separate
engineers determined that our bootdisks in our D240 are the bottleneck
with system scalls bottling up the I/O. A further examination (iostat
-xnp) of the boot drives shows that every 30 seconds, the disk is
getting slammed with approximately 200-300 I/O writes in a two-second
period with blocking averaging between 75-100%.

I have several problems in isolating the cause for the I/O bottleneck,
however. One is that we have over 2,300 users with over 7,000 processes
running during a normal day. The second is that the boot disks are
encapsulated under VxVM. So, all of the activity is showing up under
slice 7, which is the public region, instead of the actual /opt, /var,
or /. So, there is no way to determine from iostat exactly where the
hundreds of writes are coming from.

Sun recommended lsof, but since it's an open-source utility, they don't
support it. lsof obviously has a boat-load of options to try to get the
appropriate data. Running lsof by itself is useless because of the huge
amount of I/O that we get on a normal day.

I believe that I have Adrian Cockroft's Solaris Tuning book at home, but
it will still take time to read through and try to figure out what might
be happening. Because this is our production system, I obviously can't
do anything on the fly that might require a reboot.

Has anyone run into these kinds of problems? Any ideas on what to look
for? Any suggestions on the recommended syntax for lsof? Are there
better tools out there to try to find this bottleneck?
Reply With Quote