This is a discussion on Disconnects hanging server within the Pgsql General forums, part of the PostgreSQL category; --> We have a dual 3.0 GHz Intel Dual-core Xserve, running Mac OS X 10.5.1 Leopard Server and PostgreSQL 8.2.5. ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| We have a dual 3.0 GHz Intel Dual-core Xserve, running Mac OS X 10.5.1 Leopard Server and PostgreSQL 8.2.5. When we disconnect several clients at a time (30+) in production, the CPU goes through the roof and the server will hang for many seconds where it is completely non- responsive. It seems the busier the server is, the longer the machine will hang. With an identical postgresql.conf file in the identical production environment, our Linux 2.6.22 box running PG 8.2.5 has no problems when disconnecting multiple clients. Also, our prior G5 Xserve running Mac OS X Server 10.4.9 and PG 8.2.4 had no issues disconnecting multiple clients. Using pgbench, I have been able to duplicate the issue on another Intel Xserve running 10.5.1 on a fresh install of PG 8.2.5. PG was compiled 64-bit using CFLAGS='-args x86_64'. The only config option was --enable-thread-safety. The only modifications I have made to the postgresql.conf file are as follows: max_connections = 175 shared_buffers = 3GB # The max supported under 10.5.1 -- After setting shmall, shmax accordingly checkpoint_segments = 64 I used a scale factor of 150 when initializing a database for pgbench. If I run `pgbench -c 150 -t 5000` and kill it (cntrl-c) shortly after launching it, but after it completes its vacuum, there is a very minor and brief increase in CPU usage (which I didn't notice at all btw on the Linux box). If I let pgbench run for approximately 10 minutes and then cntrl-c it, the CPU will max out and the machine will hang. iostat stops reporting and top stops refreshing. This lasts for a couple seconds, then top and iostat resume. Here is what iostat showed when I killed pgbench after approximately 10 minutes: postgres$ iostat -n 5 1 .... disk0 disk1 disk2 disk3 cpu load average KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us sy id 1m 5m 15m 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10.20 732 7.30 2 4 93 1.07 2.22 2.30 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.96 766 6.71 1 2 98 1.07 2.22 2.30 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.33 755 6.88 1 2 97 1.07 2.22 2.30 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.86 777 6.73 0 2 97 1.07 2.22 2.30 --> I hit ctrl-c to kill pgbench here 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.17 766 6.86 1 43 55 1.07 2.22 2.30 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.03 770 6.79 0 79 20 1.71 2.33 2.34 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.04 77 0.68 1 38 61 1.71 2.33 2.34 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.95 273 2.39 0 80 19 1.71 2.33 2.34 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 15.03 240 3.53 1 99 1 1.71 2.33 2.34 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.69 365 3.45 1 99 0 4.05 2.80 2.51 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 100 0 4.05 2.80 2.51 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 0 100 0 4.05 2.80 2.51 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.50 16 0.13 0 100 0 8.85 3.82 2.87 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 10.18 17 0.17 0 100 0 8.85 3.82 2.87 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.00 75 0.66 0 100 0 8.85 3.82 2.87 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.50 16 0.13 0 100 0 12.39 4.64 3.16 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.10 68 0.60 0 100 0 14.20 5.14 3.35 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.75 75 0.64 0 100 0 14.20 5.14 3.35 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.83 249 2.14 0 100 0 14.20 5.14 3.35 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.57 14 0.12 0 100 0 15.46 5.55 3.50 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.33 265 2.41 1 99 0 15.46 5.55 3.50 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.15 361 3.22 0 100 0 15.46 5.55 3.50 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.33 40 0.32 1 99 0 15.46 5.55 3.50 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.93 843 7.36 0 100 0 17.43 6.12 3.72 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.84 560 4.84 0 100 0 17.43 6.12 3.72 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.06 428 3.79 1 99 0 17.43 6.12 3.72 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.00 12 0.10 0 100 0 17.43 6.12 3.72 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.92 243 2.12 0 91 9 17.43 6.12 3.72 --> unit recovered here: 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 13.11 628 8.03 0 2 97 16.03 6.02 3.69 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.00 517 4.04 0 2 97 16.03 6.02 3.69 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 8.88 511 4.43 0 2 97 16.03 6.02 3.69 0.00 0 0.00 0.00 0 0.00 0.00 0 0.00 9.02 503 4.43 0 2 98 16.03 6.02 3.69 I installed PG 8.3 beta 3 to see if the behavior would be any different. The CPU usage in general seemed higher in PG 8.3 beta 3, and I still get the spike when disconnecting multiple clients. I tried with default settings on 8.2.5 (except for a higher max_connections), as well as with only a higher shared_buffers, and also with only a higher checkpoint_segments. The CPU would still spike to 100 in all of these cases, but it didn't seem to stay there as long as when checkpoint_segments and shared_buffers are high. I suppose the only difference may be when I'm killing pgbench. I'm not sure if this is a bug with PostgreSQL or OS X 10.5.1. Any suggestions on what I can do to narrow down the problem further would be greatly appreciated. Brian Wipf ClickSpace Interactive Inc. <brian@clickspace.com> ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| On Dec 3, 2007, at 4:16 PM, Brian Wipf wrote: > We have a dual 3.0 GHz Intel Dual-core Xserve, running Mac OS X > 10.5.1 Leopard Server and PostgreSQL 8.2.5. When we disconnect > several clients at a time (30+) in production, the CPU goes through > the roof and the server will hang for many seconds where it is > completely non-responsive. It seems the busier the server is, the > longer the machine will hang. You should run Shark or Instruments to determine where the system is getting hung up. You will likely need to install developer tools. If you need help reading the profilers' output, please join up on an Apple list. In my profiling of PostgreSQL under 10.4 with PostgreSQL 8.1, I found disappointing results with bottlenecks in the mutex-locked stdio. I suspect that the results in 10.5 may be drastically different. Cheers, M ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org/ |
| |||
| On 3-Dec-07, at 3:51 PM, A.M. wrote: > On Dec 3, 2007, at 4:16 PM, Brian Wipf wrote: > >> We have a dual 3.0 GHz Intel Dual-core Xserve, running Mac OS X >> 10.5.1 Leopard Server and PostgreSQL 8.2.5. When we disconnect >> several clients at a time (30+) in production, the CPU goes through >> the roof and the server will hang for many seconds where it is >> completely non-responsive. It seems the busier the server is, the >> longer the machine will hang. > > You should run Shark or Instruments to determine where the system is > getting hung up. You will likely need to install developer tools. If > you need help reading the profilers' output, please join up on an > Apple list. As per A.M.'s suggestion, I have run a time profile in Shark to get some idea of what's going on when the server hangs when disconnecting clients. Nearly 100% of the CPU is going into pmap_remove_range. The stack trace for pmap_remove_range, viewable within Shark, is: -> pmap_remove_range --> pmap_remove ---> vm_map_simplify ----> vm_map_remove -----> task_terminate_internal ------> exit1 -------> exit --------> unix_syscall64 ---------> lo64_unix_scall The call taking up the next highest amount of CPU, at 0.1%, is AtProcExit_Buffers. And its stack trace: -> AtProcExit_Buffers --> shmem_exit ---> proc_exit ----> PostgresMain -----> BackendRun ------> BackendStartup -------> ServerLoop --------> PostmasterMain ---------> main ----------> start Brian Wipf <brian@clickspace.com> ClickSpace Interactive Inc. ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| ||||
| Brian Wipf <brian@clickspace.com> writes: > Nearly 100% of the CPU is going into pmap_remove_range. The stack > trace for pmap_remove_range, viewable within Shark, is: > -> pmap_remove_range > --> pmap_remove > ---> vm_map_simplify > ----> vm_map_remove > -----> task_terminate_internal > ------> exit1 > -------> exit > --------> unix_syscall64 > ---------> lo64_unix_scall In case it's not obvious, this is a kernel performance bug, which you should report to Apple. In the meantime you might want to think about backing off your shared_buffers setting. I would suppose that the performance bug is being triggered by a very large shared memory segment (you said 3Gb right?). regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |