Unix Technical Forum

8.3beta1 testing on Solaris

This is a discussion on 8.3beta1 testing on Solaris within the Pgsql Performance forums, part of the PostgreSQL category; --> Update on my testing 8.3beta1 on Solaris. * CLOG reads * Asynchronous Commit benefit * Hot CPU Utilization Regards, ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 11:41 AM
Jignesh K. Shah
 
Posts: n/a
Default 8.3beta1 testing on Solaris

Update on my testing 8.3beta1 on Solaris.

* CLOG reads
* Asynchronous Commit benefit
* Hot CPU Utilization

Regards,
Jignesh

__Background_:_
We were using PostgreSQL 8.3beta1 testing on our latest Sun SPARC
Enterprise T5220 Server using Solaris 10 8/07 and Sun Fire X4200 using
Solaris 10 8/07. Generally for performance benefits in Solaris we put
file systems on forcedirectio we bypass the filesystem cache and go
direct to disks.

__Problem_:_
What we were observing that there were lots of reads happening about
4MB/sec on the file system holding $PGDATA and the database tables
during an OLTP Benchmark run. Initially we thought that our bufferpools
were not big enough. But thanks to 64-bit builds we could use bigger
bufferpools. However even with extraordinary bufferpool sizes we still
saw lots of reads going to the disks.

__DTrace to the Rescue_:_

I modified iosnoop.d to just snoop on reads. The modified rsnoop.d is as
follows:
$ cat rsnoop.d
#!/usr/sbin/dtrace -s
syscall::read:entry
/execname=="postgres"/
{
printf("pid %d reading %s\n", pid, fds[arg0].fi_pathname);
}

Based on it I found that most postgresql processes were doing lots of
reads from pg_clog directory.
CLOG or commit logs keep track of transactions in flight. Writes of CLOG
comes from recording of transaction commits( or when it aborts) or when
an XLOG is generated. However though I am not clear on reads yet, it
seems every process constantly reads it to get some status. CLOG data is
not cached in any PostgreSQL shared memory segments and hence becomes
the bottleneck as it has to constantly go to the filesystem to get the
read data.
# ./rsnoop.d
dtrace: script './rsnoop.d' matched 1 probe
CPU ID FUNCTION:NAME
0 49222 read:entry pid 8739 reading
/export/home0/igen/pgdata/pg_clog/000C

0 49222 read:entry pid 9607 reading
/export/home0/igen/pgdata/pg_clog/000C

0 49222 read:entry pid 9423 reading
/export/home0/igen/pgdata/pg_clog/000C

0 49222 read:entry pid 8731 reading
/export/home0/igen/pgdata/pg_clog/000C

0 49222 read:entry pid 8719 reading
/export/home0/igen/pgdata/pg_clog/000C

0 49222 read:entry pid 9019 reading
/export/home0/igen/pgdata/pg_clog/000C

1 49222 read:entry pid 9255 reading
/export/home0/igen/pgdata/pg_clog/000C

1 49222 read:entry pid 8867 reading
/export/home0/igen/pgdata/pg_clog/000C


Later on during another run I added ustack() after the printf in the
above script to get the function name also:

# ./rsnoop.d
dtrace: script './rsnoop.d' matched 1 probe
CPU ID FUNCTION:NAME
0 49222 read:entry pid 10956 reading
/export/home0/igen/pgdata/pg_clog/0011
libc.so.1`_read+0xa
postgres`SimpleLruReadPage+0x3e6
postgres`SimpleLruReadPage_ReadOnly+0x9b
postgres`TransactionIdGetStatus+0x1f
postgres`TransactionIdDidCommit+0x42
postgres`HeapTupleSatisfiesVacuum+0x21a
postgres`heap_prune_chain+0x14b
postgres`heap_page_prune_opt+0x1e6
postgres`index_getnext+0x144
postgres`IndexNext+0xe1
postgres`ExecScan+0x189
postgres`ExecIndexScan+0x43
postgres`ExecProcNode+0x183
postgres`ExecutePlan+0x9e
postgres`ExecutorRun+0xab
postgres`PortalRunSelect+0x47a
postgres`PortalRun+0x262
postgres`exec_execute_message+0x565
postgres`PostgresMain+0xf45
postgres`BackendRun+0x3f9

0 49222 read:entry pid 10414 reading
/export/home0/igen/pgdata/pg_clog/0011
libc.so.1`_read+0xa
postgres`SimpleLruReadPage+0x3e6
postgres`SimpleLruReadPage_ReadOnly+0x9b
postgres`TransactionIdGetStatus+0x1f
postgres`TransactionIdDidCommit+0x42
postgres`HeapTupleSatisfiesVacuum+0x21a
postgres`heap_prune_chain+0x14b
postgres`heap_page_prune_opt+0x1e6
postgres`index_getnext+0x144
postgres`IndexNext+0xe1
postgres`ExecScan+0x189
^C libc.so.1`_read+0xa
postgres`SimpleLruReadPage+0x3e6
postgres`SimpleLruReadPage_ReadOnly+0x9b
postgres`TransactionIdGetStatus+0x1f
postgres`TransactionIdDidCommit+0x42
postgres`HeapTupleSatisfiesMVCC+0x34f
postgres`index_getnext+0x29e
postgres`IndexNext+0xe1
postgres`ExecScan+0x189
postgres`ExecIndexScan+0x43
postgres`ExecProcNode+0x183
postgres`ExecutePlan+0x9e
postgres`ExecutorRun+0xab
postgres`PortalRunSelect+0x47a
postgres`PortalRun+0x262
postgres`exec_execute_message+0x565
postgres`PostgresMain+0xf45
postgres`BackendRun+0x3f9
postgres`BackendStartup+0x271
postgres`ServerLoop+0x259

0 49222 read:entry pid 10186 reading
/export/home0/igen/pgdata/pg_clog/0011
libc.so.1`_read+0xa
postgres`SimpleLruReadPage+0x3e6
postgres`SimpleLruReadPage_ReadOnly+0x9b
postgres`TransactionIdGetStatus+0x1f
postgres`TransactionIdDidCommit+0x42
postgres`HeapTupleSatisfiesVacuum+0x21a
postgres`heap_prune_chain+0x14b
postgres`heap_page_prune_opt+0x1e6
postgres`index_getnext+0x144
postgres`IndexNext+0xe1
postgres`ExecScan+0x189
postgres`ExecIndexScan+0x43
postgres`ExecProcNode+0x183
postgres`ExecutePlan+0x9e
postgres`ExecutorRun+0xab
postgres`PortalRunSelect+0x47a
postgres`PortalRun+0x262
postgres`exec_execute_message+0x565
postgres`PostgresMain+0xf45
postgres`BackendRun+0x3f9


So multiple processes are reading the same file. In this case since the
file system is told not to cache files, hence all read ios are being
sent to the disk to read the file again.



__Workaround for the high reads on CLOG on Solaris_ :
_Start with the cluster $PGDATA on regular UFS (which is buffered and
logging is enabled). Always create a new tablespace for your database on
forcedirectio mounted file system which bypasses the file system cache.
This allows all PostgreSQL CLOG files to be cached in UFS greatly
reducing stress on the underlying storage. For writes to the best of my
knowledge, PostgreSQL will still do fsync to force the writes the CLOGs
onto the disks so it is consistent. But the reads are spared from going
to the disks and returned from the cache.

__Result_:_
With rightly sized bufferpool now all database data can be in PostgreSQL
cache and hence reads are spared from the tablespaces. As for PGDATA
data, UFS will do the caching of CLOG files, etc and hence sparring
reads from going to the disks again. In the end what we achieve is a
right sized bufferpool where there are no reads required during a high
OLTP environment and the disks are just busy doing the writes of updates
and inserts.


_Asynchronous Commit_:

Also as requested by Josh, I tried out Asynchronous Commit in 8.3beta 1
I compared four scenarios on internal disks (the prime target)
1. Default options (commit_delay off, synchronous_commit=true)
2. With Commit_delay on
3. With Asynchronous and Commit_delay on
4. With Asynchronous commit but Commit_delay off
5. With Fsync off


In 8.2 I found compared to (1),( 2) gave me a huge boost (2X) but fsync
would be eventually even 2.8X faster than (2)

In 8.3 hence I did not even test the default option and took (2) as my
baseline run and found (3),(4),(5) pretty much gave me the similar boost
2.55X over my baseline run (2) since eventually I was CPU bound on my
box and IO ended up handling well.

(Though I found (5) was better in 8.2 compared to (5) in 8.3beta1 since
it was getting CPU saturated slightly earlier)


_Additional CPU consumption Findings_:
In the lightweight OLTP Testing that was performed with about 1000
users, with 8.3 with the above workaround in place for CLOG. I reached a
scenario where the system was out of CPU resources with about 1000
users. Anyway doing a quick profiling using the "hotuser" program
available in the DTraceToolkit the top function is postgres`GetSnapshotData

# ./hotuser -p 10559

.....
postgres`hash_seq_term 1 2.1%
postgres`SearchCatCache 2 4.2%
postgres`hash_search_with_hash_value 4 8.3%
postgres`GetSnapshotData 6 12.5%



Also Lock Waits during the 1000 User run was as follows:

# ./83_lwlock_wait.d 9231

Lock Id Mode Count
WALInsertLock Exclusive 1
ProcArrayLock Exclusive 19

Lock Id Combined Time (ns)
WALInsertLock 428507
ProcArrayLock 1009652776

# ./83_lwlock_wait.d 9153

Lock Id Mode Count
CLogControlLock Exclusive 1
WALInsertLock Exclusive 1
ProcArrayLock Exclusive 15

Lock Id Combined Time (ns)
CLogControlLock 25536
WALInsertLock 397265
ProcArrayLock 696894211

#


My Guess is that the ProcArrayLock is coming from the GetSnapShotData
function ... or maybe caused by it.. But I guess I will let the experts
comment ..

I am of the opionion that if we tune GetSnapShotData, then we should be
able to handle more users.


So overall I think I am excited with the 8.3beta1 performance specially
in terms of asynchronous_commit however to get the CPU performance in
line with GetSnapshotData() and also fixing the CLOG reading problem
occuring in SimpleLruRead() will greatly enhance the performance of 8.3
for OLTP benchmarks.




---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 11:41 AM
Tom Lane
 
Posts: n/a
Default Re: [HACKERS] 8.3beta1 testing on Solaris

"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
> CLOG data is
> not cached in any PostgreSQL shared memory segments


The above statement is utterly false, so your trace seems to indicate
something broken. Are you sure these were the only reads of pg_clog
files? Can you extend the tracing to determine which page of the file
got read? I am wondering if your (unspecified) test load was managing
to touch more pages of the clog than there is room for in shared memory.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 11:41 AM
Gregory Stark
 
Posts: n/a
Default Re: 8.3beta1 testing on Solaris


"Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:

> CLOG data is not cached in any PostgreSQL shared memory segments and hence
> becomes the bottleneck as it has to constantly go to the filesystem to get
> the read data.


This is the same bottleneck you discussed earlier. CLOG reads are cached in
the Postgres shared memory segment but only NUM_CLOG_BUFFERS are which
defaults to 8 buffers of 8kb each. With 1,000 clients and the transaction rate
you're running you needed a larger number of buffers.

Using the filesystem buffer cache is also an entirely reasonable solution
though. That's surely part of the logic behind not trying to keep more of the
clog in shared memory. Do you have any measurements of how much time is being
spent just doing the logical I/O to the buffer cache for the clog pages? 4MB/s
seems like it's not insignificant but your machine is big enough that perhaps
I'm thinking at the wrong scale.

I'm really curious whether you see any benefit from the vxid read-only
transactions. I'm not sure how to get an apples to apples comparison though.
Ideally just comparing it to CVS HEAD from immediately prior to the vxid patch
going in. Perhaps calling some function which forces an xid to be allocated
and seeing how much it slows down the benchmark would be a good substitute.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 11:41 AM
Gregory Stark
 
Posts: n/a
Default Re: [HACKERS] 8.3beta1 testing on Solaris

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
>> CLOG data is
>> not cached in any PostgreSQL shared memory segments

>
> The above statement is utterly false, so your trace seems to indicate
> something broken. Are you sure these were the only reads of pg_clog
> files? Can you extend the tracing to determine which page of the file
> got read? I am wondering if your (unspecified) test load was managing
> to touch more pages of the clog than there is room for in shared memory.


Didn't we already go through this? He and Simon were pushing to bump up
NUM_CLOG_BUFFERS and you were arguing that the test wasn't representative and
some other clog.c would have to be reengineered to scale well to larger
values.

Also it seemed there were only modest improvements from raising the value and
there would always be a ceiling to bump into so just raising the number of
buffers isn't particularly interesting unless there's some magic numbers we're
trying to hit.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 11:41 AM
Tom Lane
 
Posts: n/a
Default Re: [HACKERS] 8.3beta1 testing on Solaris

Gregory Stark <stark@enterprisedb.com> writes:
> Didn't we already go through this? He and Simon were pushing to bump up
> NUM_CLOG_BUFFERS and you were arguing that the test wasn't representative and
> some other clog.c would have to be reengineered to scale well to larger
> values.


AFAIR we never did get any clear explanation of what the test case is.
I guess it must be write-mostly, else lazy XID assignment would have
helped this by reducing the rate of XID consumption.

It's still true that I'm leery of a large increase in the number of
buffers without reengineering slru.c. That code was written on the
assumption that there were few enough buffers that a linear search
would be fine. I'd hold still for 16, or maybe even 32, but I dunno
how much impact that will have for such a test case.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 11:41 AM
Josh Berkus
 
Posts: n/a
Default Re: [HACKERS] 8.3beta1 testing on Solaris

Tom,

> It's still true that I'm leery of a large increase in the number of
> buffers without reengineering slru.c. That code was written on the
> assumption that there were few enough buffers that a linear search
> would be fine. I'd hold still for 16, or maybe even 32, but I dunno
> how much impact that will have for such a test case.


Actually, 32 made a significant difference as I recall ... do you still have
the figures for that, Jignesh?

The test case is a workload called "iGen" which is a "fixed" TPCC-like
workload. I've been trying to talk Sun into open-sourcing it, but no dice so
far. It is heavy on writes, and (like TPCC) consists mostly of one-line
transactions.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 11:41 AM
Tom Lane
 
Posts: n/a
Default Re: [HACKERS] 8.3beta1 testing on Solaris

Josh Berkus <josh@agliodbs.com> writes:
> Actually, 32 made a significant difference as I recall ... do you still have
> the figures for that, Jignesh?


I'd want to see a new set of test runs backing up any call for a change
in NUM_CLOG_BUFFERS --- we've changed enough stuff around this area that
benchmarks using code from a few months back shouldn't carry a lot of
weight.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 11:41 AM
Gregory Stark
 
Posts: n/a
Default Re: [HACKERS] 8.3beta1 testing on Solaris

"Josh Berkus" <josh@agliodbs.com> writes:

> Actually, 32 made a significant difference as I recall ... do you still have
> the figures for that, Jignesh?


Well it made a difference but it didn't remove the bottleneck, it just moved
it. IIRC under that benchmark Jignesh was able to run with x sessions
efficiently with 8 clog buffers, x + 100 or so sessions with 16 clog buffers
and x + 200 or so sessions with 32 clog buffers.

It happened that x + 200 was > the number of sessions he wanted to run the
benchmark at so it helped the benchmark results quite a bit. But that was just
an artifact of how many sessions the benchmark needed. A user who needs 1200
sessions or who has a different transaction load might find he needs more clog
buffers to alleviate the bottleneck. And of course most (all?) normal users
use far fewer sessions and won't run into this bottleneck at all.

Raising NUM_CLOG_BUFFERS just moves around the arbitrary bottleneck. This
benchmark is useful in that it gives us an idea where the bottleneck lies for
various values of NUM_CLOG_BUFFERS but it doesn't tell us what value realistic
users are likely to bump into.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 11:41 AM
Jignesh K. Shah
 
Posts: n/a
Default Re: 8.3beta1 testing on Solaris


Hi George,

I have seen the 4M/sec problem first actually during an EAStress type
run with only 150 connections.

I will try to do more testing today that Tom has requested.

Regards,
Jignesh


Gregory Stark wrote:
> "Jignesh K. Shah" <J.K.Shah@Sun.COM> writes:
>
>
>> CLOG data is not cached in any PostgreSQL shared memory segments and hence
>> becomes the bottleneck as it has to constantly go to the filesystem to get
>> the read data.
>>

>
> This is the same bottleneck you discussed earlier. CLOG reads are cached in
> the Postgres shared memory segment but only NUM_CLOG_BUFFERS are which
> defaults to 8 buffers of 8kb each. With 1,000 clients and the transaction rate
> you're running you needed a larger number of buffers.
>
> Using the filesystem buffer cache is also an entirely reasonable solution
> though. That's surely part of the logic behind not trying to keep more of the
> clog in shared memory. Do you have any measurements of how much time is being
> spent just doing the logical I/O to the buffer cache for the clog pages? 4MB/s
> seems like it's not insignificant but your machine is big enough that perhaps
> I'm thinking at the wrong scale.
>
> I'm really curious whether you see any benefit from the vxid read-only
> transactions. I'm not sure how to get an apples to apples comparison though.
> Ideally just comparing it to CVS HEAD from immediately prior to the vxid patch
> going in. Perhaps calling some function which forces an xid to be allocated
> and seeing how much it slows down the benchmark would be a good substitute.
>
>


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 11:41 AM
Jignesh K. Shah
 
Posts: n/a
Default Re: [HACKERS] 8.3beta1 testing on Solaris

The problem I saw was first highlighted by EAStress runs with PostgreSQL
on Solaris with 120-150 users. I just replicated that via my smaller
internal benchmark that we use here to recreate that problem.

EAStress should be just fine to highlight it.. Just put pg_clog on
O_DIRECT or something so that all IOs go to disk making it easier to
observe.

In the meanwhile I will try to get more information.


Regards,
Jignesh


Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
>
>> Didn't we already go through this? He and Simon were pushing to bump up
>> NUM_CLOG_BUFFERS and you were arguing that the test wasn't representative and
>> some other clog.c would have to be reengineered to scale well to larger
>> values.
>>

>
> AFAIR we never did get any clear explanation of what the test case is.
> I guess it must be write-mostly, else lazy XID assignment would have
> helped this by reducing the rate of XID consumption.
>
> It's still true that I'm leery of a large increase in the number of
> buffers without reengineering slru.c. That code was written on the
> assumption that there were few enough buffers that a linear search
> would be fine. I'd hold still for 16, or maybe even 32, but I dunno
> how much impact that will have for such a test case.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:58 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com