Unix Technical Forum

Best COPY Performance

This is a discussion on Best COPY Performance within the Pgsql Performance forums, part of the PostgreSQL category; --> What is the best COPY performance that you have gotten on a "normal" table? I know that this is ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 09:39 AM
Worky Workerson
 
Posts: n/a
Default Best COPY Performance

What is the best COPY performance that you have gotten on a "normal" table?

I know that this is question is almost too general, but it might help
me out a bit, or at least give me the right things to tweak. Perhaps
the question can be rewritten as "Where are the major bottlenecks in a
COPY?" or "How can I compute the max theoretical COPY performance for
my hardware?". The two subquestions that I have from this are:
-Are my ETL scripts (perl) maximizing the database COPY speeds?
-Can I tweak my DB further to eek out a bit more performance?

I'm using perl to ETL a decent sized data set (10 million records) and
then loading it through perl:BI's copy. I am currently getting
between 10K and 15K inserts/second. I've profiled the ETL scripts a
bit and have performance-improved a lot of the code, but I'd like to
determine whether it makes sense to try and further optimize my Perl
or count it as "done" and look for improvements elsewhere.

I ran trivial little insert into a table with a single integer row and
came close to 250K inserts/second using psql's \copy, so I'm thinking
that my code could be optimized a bit more, but wanted to check around
to see if that was the case.

I am most interested in loading two tables, one with about 21 (small)
VARCHARs where each record is about 200 bytes, and another with 7
INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
bytes.

I have implemented most of the various bits of PG config advice that I
have seen, both here and with a some googling, such as:

wal_buffers=128
checkpoint_segments=128
checkpoint_timeout=3000

Software: PG 8.1.3 on RHEL 4.3 x86_64
Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC

Thanks!

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 09:39 AM
Merlin Moncure
 
Posts: n/a
Default Re: Best COPY Performance

On 10/21/06, Worky Workerson <worky.workerson@gmail.com> wrote:
> What is the best COPY performance that you have gotten on a "normal" table?
>
> I know that this is question is almost too general, but it might help
> me out a bit, or at least give me the right things to tweak. Perhaps
> the question can be rewritten as "Where are the major bottlenecks in a
> COPY?" or "How can I compute the max theoretical COPY performance for
> my hardware?". The two subquestions that I have from this are:
> -Are my ETL scripts (perl) maximizing the database COPY speeds?
> -Can I tweak my DB further to eek out a bit more performance?
>
> I'm using perl to ETL a decent sized data set (10 million records) and
> then loading it through perl:BI's copy. I am currently getting
> between 10K and 15K inserts/second. I've profiled the ETL scripts a
> bit and have performance-improved a lot of the code, but I'd like to
> determine whether it makes sense to try and further optimize my Perl
> or count it as "done" and look for improvements elsewhere.
>
> I ran trivial little insert into a table with a single integer row and
> came close to 250K inserts/second using psql's \copy, so I'm thinking
> that my code could be optimized a bit more, but wanted to check around
> to see if that was the case.
>
> I am most interested in loading two tables, one with about 21 (small)
> VARCHARs where each record is about 200 bytes, and another with 7
> INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
> bytes.


indexes/keys? more memory for sorting during index creation can have
a dramatic affect on bulk insert performance. check for pg_tmp
folders popping up during copy run.

> I have implemented most of the various bits of PG config advice that I
> have seen, both here and with a some googling, such as:
>
> wal_buffers=128
> checkpoint_segments=128
> checkpoint_timeout=3000
>
> Software: PG 8.1.3 on RHEL 4.3 x86_64
> Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC


for table light on indexes, 10-15k for copy is pretty poor. you can
get pretty close to that with raw inserts on good hardware. I would
suggest configuirng your perl script to read from stdin and write to
stdout, and pipe it to psql using copy from stdin. then just
benchmark your perl script redirecting output to a file.

merlin

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 09:40 AM
Markus Schaber
 
Posts: n/a
Default Re: Best COPY Performance

Hi, Worky,

Worky Workerson wrote:
> I am currently getting
> between 10K and 15K inserts/second.


> I ran trivial little insert into a table with a single integer row and
> came close to 250K inserts/second using psql's \copy, so I'm thinking
> that my code could be optimized a bit more, but wanted to check around
> to see if that was the case.


Could you COPY one of your tables out to disk via psql, and then COPY it
back into the database, to reproduce this measurement with your real data?

Also, how much is the disk load, and CPU usage?

As long as psql is factor 20 better than your perl script, I think that
the perl interface is what should be optimized.

On a table with no indices, triggers and contstraints, we managed to
COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
here the network was the bottleneck.

You should think about making your perl program writing the COPY
statement as text, and piping it into psql.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 09:40 AM
Luke Lonergan
 
Posts: n/a
Default Re: Best COPY Performance

Markus,

On 10/23/06 2:27 AM, "Markus Schaber" <schabi@logix-tt.com> wrote:

> On a table with no indices, triggers and contstraints, we managed to
> COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
> here the network was the bottleneck.


We routinely get 10-12MB/s on I/O hardware that can sustain a sequential
write rate of 60+ MB/s with the WAL and data on the same disks.

It depends on a few things you might not consider, including the number and
type of columns in the table and the client and server encoding. The
fastest results are with more columns in a table and when the client and
server encoding are the same.

- Luke



---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 09:40 AM
Worky Workerson
 
Posts: n/a
Default Re: Best COPY Performance

> > I am most interested in loading two tables, one with about 21 (small)
> > VARCHARs where each record is about 200 bytes, and another with 7
> > INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350
> > bytes.

>
> indexes/keys? more memory for sorting during index creation can have
> a dramatic affect on bulk insert performance. check for pg_tmp
> folders popping up during copy run.


The only index on load is a single IP4 btree primary key, which I
figure should function about like an INTEGER.

> for table light on indexes, 10-15k for copy is pretty poor. you can
> get pretty close to that with raw inserts on good hardware. I would
> suggest configuirng your perl script to read from stdin and write to
> stdout, and pipe it to psql using copy from stdin. then just
> benchmark your perl script redirecting output to a file.


So simple and hadn't thought of that ... thanks. When I pre-create a
COPY file, I can load it at about 45K inserts/sec (file was 1.8GB or
14.5 million records in 331 seconds), which looks like its about 5.5
MB/s. I'm loading from a local 15K SCSI320 RAID10 (which also
contains the PG log files) to a 10K SCSI320 RAID10 on an FC SAN. Does
this look more consistent with "decent" performance, or should I go
looking into some hardware issues i.e. SAN configuration? I've
currently got several hats including hardware/systems/security admin,
as well as DBA and programmer, and my SAN setup skills could
definitely use some more work.

Hardware aside, my perl can definitely use some work, and it seems to
be mostly the CSV stuff that I am using, mostly for convenience. I'll
see if I can't redo some of that to eliminate some CSV processing, or,
barring that, multithread the process to utilize more of the CPUs.
Part of the reason that I hadn't used psql in the first place is that
I'm loading the data into partitioned tables, and the loader keeps
several COPY connections open at a time to load the data into the
right table. I guess I could just as easily keep several psql pipes
open, but it seemed cleaner to go through DBI.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 09:40 AM
Joshua D. Drake
 
Posts: n/a
Default Re: Best COPY Performance


> Ultimately, you might be best of using triggers instead of rules for the
> partitioning since then you could use copy. Or go to raw insert commands
> that are wrapped in a transaction.


My experience is that triggers are quite a bit faster than rules in any
kind of partitioning that involves more than say 7 tables.

Sincerely,

Joshua D. Drake


--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 09:40 AM
Craig A. James
 
Posts: n/a
Default Re: Best COPY Performance

Jim C. Nasby wrote:
> http://stats.distributed.net used to use a perl script to do some
> transformations before loading data into the database. IIRC, when we
> switched to using C we saw 100x improvement in speed, so I suspect that
> if you want performance perl isn't the way to go. I think you can
> compile perl into C, so maybe that would help some.


I use Perl extensively, and have never seen a performance problem. I suspect the perl-to-C "100x improvement" was due to some other factor, like a slight change in the schema, indexes, or the fundamental way the client (C vs Perl) handled the data during the transformation, or just plain bad Perl code.

Modern scripting languages like Perl and Python make programmers far, far more productive than the bad old days of C/C++. Don't shoot yourself in the foot by reverting to low-level languages like C/C++ until you've exhausted all other possibilities. I only use C/C++ for intricate scientific algorithms.

In many cases, Perl is *faster* than C/C++ code that I write, because I can't take the time (for example) to write the high-performance string manipulation that have been fine-tuned and extensively optimized in Perl.

Craig


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 09:41 AM
Worky Workerson
 
Posts: n/a
Default Re: Best COPY Performance

Markus,

> Could you COPY one of your tables out to disk via psql, and then COPY it
> back into the database, to reproduce this measurement with your real data?


$ psql -c "COPY my_table TO STDOUT" > my_data
$ ls my_data
2018792 edgescape_pg_load
$ time cat my_data | psql -c "COPY mytable FROM STDIN"
real 5m43.194s
user 0m35.412s
sys 0m9.567s

> Also, how much is the disk load, and CPU usage?


When I am loading via the perl (which I've established is a
bottleneck), the one CPU core is at 99% for the perl and another is at
30% for a postmaster, vs about 90% for the postmaster when going
through psql.

The disk load is where I start to get a little fuzzy, as I haven't
played with iostat to figure what is "normal". The local drives
contain PG_DATA as well as all the log files, but there is a
tablespace on the FibreChannel SAN that contains the destination
table. The disk usage pattern that I see is that there is a ton of
consistent activity on the local disk, with iostat reporting an
average of 30K Blk_wrtn/s, which I assume is the log files. Every
several seconds there is a massive burst of activity on the FC
partition, to the tune of 250K Blk_wrtn/s.

> On a table with no indices, triggers and contstraints, we managed to
> COPY about 7-8 megabytes/second with psql over our 100 MBit network, so
> here the network was the bottleneck.


hmm, this makes me think that either my PG config is really lacking,
or that the SAN is badly misconfigured, as I would expect it to
outperform a 100Mb network. As it is, with a straight pipe to psql
COPY, I'm only working with a little over 5.5 MB/s. Could this be due
to the primary key index updates?

Thanks!

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 09:41 AM
Jim C. Nasby
 
Posts: n/a
Default Re: Best COPY Performance

On Mon, Oct 23, 2006 at 03:37:47PM -0700, Craig A. James wrote:
> Jim C. Nasby wrote:
> >http://stats.distributed.net used to use a perl script to do some
> >transformations before loading data into the database. IIRC, when we
> >switched to using C we saw 100x improvement in speed, so I suspect that
> >if you want performance perl isn't the way to go. I think you can
> >compile perl into C, so maybe that would help some.

>
> I use Perl extensively, and have never seen a performance problem. I
> suspect the perl-to-C "100x improvement" was due to some other factor, like
> a slight change in the schema, indexes, or the fundamental way the client
> (C vs Perl) handled the data during the transformation, or just plain bad
> Perl code.
>
> Modern scripting languages like Perl and Python make programmers far, far
> more productive than the bad old days of C/C++. Don't shoot yourself in
> the foot by reverting to low-level languages like C/C++ until you've
> exhausted all other possibilities. I only use C/C++ for intricate
> scientific algorithms.
>
> In many cases, Perl is *faster* than C/C++ code that I write, because I
> can't take the time (for example) to write the high-performance string
> manipulation that have been fine-tuned and extensively optimized in Perl.


Well, the code is all at
http://cvs.distributed.net/viewcvs.c...s-proc/hourly/ (see logmod
directory and logmod_*.pl). There have been changes made to the C code
since we changed over, but you can find the appropriate older versions
in there. IIRC, nothing in the database changed when we went from perl
to C (it's likely that was the *only* change that happened anywhere
around that time).
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 09:41 AM
Jim C. Nasby
 
Posts: n/a
Default Re: Best COPY Performance

On Tue, Oct 24, 2006 at 09:17:08AM -0400, Worky Workerson wrote:
> >http://stats.distributed.net used to use a perl script to do some
> >transformations before loading data into the database. IIRC, when we
> >switched to using C we saw 100x improvement in speed, so I suspect that
> >if you want performance perl isn't the way to go. I think you can
> >compile perl into C, so maybe that would help some.

>
> Like Craig mentioned, I have never seen those sorts of improvements
> going from perl->C, and developer efficiency is primo for me. I've
> profiled most of the stuff, and have used XS modules and Inline::C on
> the appropriate, often used functions, but I still think that it comes
> down to my using CSV and Text::CSV_XS. Even though its XS, CSV is
> still a pain in the ass.
>
> >Ultimately, you might be best of using triggers instead of rules for the
> >partitioning since then you could use copy. Or go to raw insert commands
> >that are wrapped in a transaction.

>
> Eh, I've put the partition loading logic in the loader, which seems to
> work out pretty well, especially since I keep things sorted and am the
> only one inserting into the DB and do so with bulk loads. But I'll
> keep this in mind for later use.


Well, given that perl is using an entire CPU, it sounds like you should
start looking either at ways to remove some of the overhead from perl,
or to split that perl into multiple processes.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:37 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com