This is a discussion on Best COPY Performance within the Pgsql Performance forums, part of the PostgreSQL category; --> What is the best COPY performance that you have gotten on a "normal" table? I know that this is ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| What is the best COPY performance that you have gotten on a "normal" table? I know that this is question is almost too general, but it might help me out a bit, or at least give me the right things to tweak. Perhaps the question can be rewritten as "Where are the major bottlenecks in a COPY?" or "How can I compute the max theoretical COPY performance for my hardware?". The two subquestions that I have from this are: -Are my ETL scripts (perl) maximizing the database COPY speeds? -Can I tweak my DB further to eek out a bit more performance? I'm using perl to ETL a decent sized data set (10 million records) and then loading it through perl: between 10K and 15K inserts/second. I've profiled the ETL scripts a bit and have performance-improved a lot of the code, but I'd like to determine whether it makes sense to try and further optimize my Perl or count it as "done" and look for improvements elsewhere. I ran trivial little insert into a table with a single integer row and came close to 250K inserts/second using psql's \copy, so I'm thinking that my code could be optimized a bit more, but wanted to check around to see if that was the case. I am most interested in loading two tables, one with about 21 (small) VARCHARs where each record is about 200 bytes, and another with 7 INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350 bytes. I have implemented most of the various bits of PG config advice that I have seen, both here and with a some googling, such as: wal_buffers=128 checkpoint_segments=128 checkpoint_timeout=3000 Software: PG 8.1.3 on RHEL 4.3 x86_64 Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC Thanks! ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On 10/21/06, Worky Workerson <worky.workerson@gmail.com> wrote: > What is the best COPY performance that you have gotten on a "normal" table? > > I know that this is question is almost too general, but it might help > me out a bit, or at least give me the right things to tweak. Perhaps > the question can be rewritten as "Where are the major bottlenecks in a > COPY?" or "How can I compute the max theoretical COPY performance for > my hardware?". The two subquestions that I have from this are: > -Are my ETL scripts (perl) maximizing the database COPY speeds? > -Can I tweak my DB further to eek out a bit more performance? > > I'm using perl to ETL a decent sized data set (10 million records) and > then loading it through perl: > between 10K and 15K inserts/second. I've profiled the ETL scripts a > bit and have performance-improved a lot of the code, but I'd like to > determine whether it makes sense to try and further optimize my Perl > or count it as "done" and look for improvements elsewhere. > > I ran trivial little insert into a table with a single integer row and > came close to 250K inserts/second using psql's \copy, so I'm thinking > that my code could be optimized a bit more, but wanted to check around > to see if that was the case. > > I am most interested in loading two tables, one with about 21 (small) > VARCHARs where each record is about 200 bytes, and another with 7 > INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350 > bytes. indexes/keys? more memory for sorting during index creation can have a dramatic affect on bulk insert performance. check for pg_tmp folders popping up during copy run. > I have implemented most of the various bits of PG config advice that I > have seen, both here and with a some googling, such as: > > wal_buffers=128 > checkpoint_segments=128 > checkpoint_timeout=3000 > > Software: PG 8.1.3 on RHEL 4.3 x86_64 > Hardware: Quad Dual-core Opteron, Fibre Channel SAN with 256M BBC for table light on indexes, 10-15k for copy is pretty poor. you can get pretty close to that with raw inserts on good hardware. I would suggest configuirng your perl script to read from stdin and write to stdout, and pipe it to psql using copy from stdin. then just benchmark your perl script redirecting output to a file. merlin ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Hi, Worky, Worky Workerson wrote: > I am currently getting > between 10K and 15K inserts/second. > I ran trivial little insert into a table with a single integer row and > came close to 250K inserts/second using psql's \copy, so I'm thinking > that my code could be optimized a bit more, but wanted to check around > to see if that was the case. Could you COPY one of your tables out to disk via psql, and then COPY it back into the database, to reproduce this measurement with your real data? Also, how much is the disk load, and CPU usage? As long as psql is factor 20 better than your perl script, I think that the perl interface is what should be optimized. On a table with no indices, triggers and contstraints, we managed to COPY about 7-8 megabytes/second with psql over our 100 MBit network, so here the network was the bottleneck. You should think about making your perl program writing the COPY statement as text, and piping it into psql. HTH, Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in Europe! www.ffii.org www.nosoftwarepatents.org ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Markus, On 10/23/06 2:27 AM, "Markus Schaber" <schabi@logix-tt.com> wrote: > On a table with no indices, triggers and contstraints, we managed to > COPY about 7-8 megabytes/second with psql over our 100 MBit network, so > here the network was the bottleneck. We routinely get 10-12MB/s on I/O hardware that can sustain a sequential write rate of 60+ MB/s with the WAL and data on the same disks. It depends on a few things you might not consider, including the number and type of columns in the table and the client and server encoding. The fastest results are with more columns in a table and when the client and server encoding are the same. - Luke ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| > > I am most interested in loading two tables, one with about 21 (small) > > VARCHARs where each record is about 200 bytes, and another with 7 > > INTEGERs, 3 TIMESTAMPs, and 1 BYTEA where each record is about 350 > > bytes. > > indexes/keys? more memory for sorting during index creation can have > a dramatic affect on bulk insert performance. check for pg_tmp > folders popping up during copy run. The only index on load is a single IP4 btree primary key, which I figure should function about like an INTEGER. > for table light on indexes, 10-15k for copy is pretty poor. you can > get pretty close to that with raw inserts on good hardware. I would > suggest configuirng your perl script to read from stdin and write to > stdout, and pipe it to psql using copy from stdin. then just > benchmark your perl script redirecting output to a file. So simple and hadn't thought of that ... thanks. When I pre-create a COPY file, I can load it at about 45K inserts/sec (file was 1.8GB or 14.5 million records in 331 seconds), which looks like its about 5.5 MB/s. I'm loading from a local 15K SCSI320 RAID10 (which also contains the PG log files) to a 10K SCSI320 RAID10 on an FC SAN. Does this look more consistent with "decent" performance, or should I go looking into some hardware issues i.e. SAN configuration? I've currently got several hats including hardware/systems/security admin, as well as DBA and programmer, and my SAN setup skills could definitely use some more work. Hardware aside, my perl can definitely use some work, and it seems to be mostly the CSV stuff that I am using, mostly for convenience. I'll see if I can't redo some of that to eliminate some CSV processing, or, barring that, multithread the process to utilize more of the CPUs. Part of the reason that I hadn't used psql in the first place is that I'm loading the data into partitioned tables, and the loader keeps several COPY connections open at a time to load the data into the right table. I guess I could just as easily keep several psql pipes open, but it seemed cleaner to go through DBI. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| > Ultimately, you might be best of using triggers instead of rules for the > partitioning since then you could use copy. Or go to raw insert commands > that are wrapped in a transaction. My experience is that triggers are quite a bit faster than rules in any kind of partitioning that involves more than say 7 tables. Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Jim C. Nasby wrote: > http://stats.distributed.net used to use a perl script to do some > transformations before loading data into the database. IIRC, when we > switched to using C we saw 100x improvement in speed, so I suspect that > if you want performance perl isn't the way to go. I think you can > compile perl into C, so maybe that would help some. I use Perl extensively, and have never seen a performance problem. I suspect the perl-to-C "100x improvement" was due to some other factor, like a slight change in the schema, indexes, or the fundamental way the client (C vs Perl) handled the data during the transformation, or just plain bad Perl code. Modern scripting languages like Perl and Python make programmers far, far more productive than the bad old days of C/C++. Don't shoot yourself in the foot by reverting to low-level languages like C/C++ until you've exhausted all other possibilities. I only use C/C++ for intricate scientific algorithms. In many cases, Perl is *faster* than C/C++ code that I write, because I can't take the time (for example) to write the high-performance string manipulation that have been fine-tuned and extensively optimized in Perl. Craig ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Markus, > Could you COPY one of your tables out to disk via psql, and then COPY it > back into the database, to reproduce this measurement with your real data? $ psql -c "COPY my_table TO STDOUT" > my_data $ ls my_data 2018792 edgescape_pg_load $ time cat my_data | psql -c "COPY mytable FROM STDIN" real 5m43.194s user 0m35.412s sys 0m9.567s > Also, how much is the disk load, and CPU usage? When I am loading via the perl (which I've established is a bottleneck), the one CPU core is at 99% for the perl and another is at 30% for a postmaster, vs about 90% for the postmaster when going through psql. The disk load is where I start to get a little fuzzy, as I haven't played with iostat to figure what is "normal". The local drives contain PG_DATA as well as all the log files, but there is a tablespace on the FibreChannel SAN that contains the destination table. The disk usage pattern that I see is that there is a ton of consistent activity on the local disk, with iostat reporting an average of 30K Blk_wrtn/s, which I assume is the log files. Every several seconds there is a massive burst of activity on the FC partition, to the tune of 250K Blk_wrtn/s. > On a table with no indices, triggers and contstraints, we managed to > COPY about 7-8 megabytes/second with psql over our 100 MBit network, so > here the network was the bottleneck. hmm, this makes me think that either my PG config is really lacking, or that the SAN is badly misconfigured, as I would expect it to outperform a 100Mb network. As it is, with a straight pipe to psql COPY, I'm only working with a little over 5.5 MB/s. Could this be due to the primary key index updates? Thanks! ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Mon, Oct 23, 2006 at 03:37:47PM -0700, Craig A. James wrote: > Jim C. Nasby wrote: > >http://stats.distributed.net used to use a perl script to do some > >transformations before loading data into the database. IIRC, when we > >switched to using C we saw 100x improvement in speed, so I suspect that > >if you want performance perl isn't the way to go. I think you can > >compile perl into C, so maybe that would help some. > > I use Perl extensively, and have never seen a performance problem. I > suspect the perl-to-C "100x improvement" was due to some other factor, like > a slight change in the schema, indexes, or the fundamental way the client > (C vs Perl) handled the data during the transformation, or just plain bad > Perl code. > > Modern scripting languages like Perl and Python make programmers far, far > more productive than the bad old days of C/C++. Don't shoot yourself in > the foot by reverting to low-level languages like C/C++ until you've > exhausted all other possibilities. I only use C/C++ for intricate > scientific algorithms. > > In many cases, Perl is *faster* than C/C++ code that I write, because I > can't take the time (for example) to write the high-performance string > manipulation that have been fine-tuned and extensively optimized in Perl. Well, the code is all at http://cvs.distributed.net/viewcvs.c...s-proc/hourly/ (see logmod directory and logmod_*.pl). There have been changes made to the C code since we changed over, but you can find the appropriate older versions in there. IIRC, nothing in the database changed when we went from perl to C (it's likely that was the *only* change that happened anywhere around that time). -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| ||||
| On Tue, Oct 24, 2006 at 09:17:08AM -0400, Worky Workerson wrote: > >http://stats.distributed.net used to use a perl script to do some > >transformations before loading data into the database. IIRC, when we > >switched to using C we saw 100x improvement in speed, so I suspect that > >if you want performance perl isn't the way to go. I think you can > >compile perl into C, so maybe that would help some. > > Like Craig mentioned, I have never seen those sorts of improvements > going from perl->C, and developer efficiency is primo for me. I've > profiled most of the stuff, and have used XS modules and Inline::C on > the appropriate, often used functions, but I still think that it comes > down to my using CSV and Text::CSV_XS. Even though its XS, CSV is > still a pain in the ass. > > >Ultimately, you might be best of using triggers instead of rules for the > >partitioning since then you could use copy. Or go to raw insert commands > >that are wrapped in a transaction. > > Eh, I've put the partition loading logic in the loader, which seems to > work out pretty well, especially since I keep things sorted and am the > only one inserting into the DB and do so with bulk loads. But I'll > keep this in mind for later use. Well, given that perl is using an entire CPU, it sounds like you should start looking either at ways to remove some of the overhead from perl, or to split that perl into multiple processes. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |