This is a discussion on NOLOGGING option, or ? within the pgsql Hackers forums, part of the PostgreSQL category; --> Bruce Momjian <pgman@candle.pha.pa.us> writes: > One idea would be to look at the table file size first. If it ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Bruce Momjian <pgman@candle.pha.pa.us> writes: > One idea would be to look at the table file size first. If it has zero > blocks, lock the table and if it still has zero blocks, do the no-WAL > copy. I think that's a bad idea. It would make the behavior unpredictable --- sometimes a COPY will take an exclusive lock, and other times not; and the reason why is at a lower semantic level than the user is supposed to know about. Before you say "this is not important", consider the nontrivial risk that the stronger lock will cause a deadlock failure. I don't think that it's acceptable for lock strength to be unpredictable. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| "Luke Lonergan" <llonergan@greenplum.com> writes: > One of the reasons to consider a LOAD DATA command is that we can isolate > the need for performance improvements and special syntax from the concerns > of preserving the legacy behavior of COPY for use as the primary mechanism > for DUMP and RESTORE. .... and instead, define some new behavior that will soon be considered broken legacy code itself? There isn't any demand for changing the semantics of COPY, as far as I've noticed. If we can make it faster with the same semantics that's great, but I'm not in favor of inventing an alternate that does almost the same thing but (eg) breaks backslash handling in the name of speed. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Wed, 2005-06-01 at 10:35 -0700, Alon Goldshuv wrote: > I have been working on improving the COPY command performance > Around 40% for 15 column (mixed types) table. > Around 90% for 1 column table. Thats very cool. > 2) A modified command syntax for introducing a direct single row error > handling. By direct I mean - a row that if rejected from within the COPY > command context does not throw an error and rollsback the whole transaction. > Instead the error is caught and recorded elsewhere, maybe in some error > table, with some more information that can later on be retrieved. The > following rows continue to be processed. This way there is barely any error > handling overhead. Having a recursive row isolation into smaller batches is > extremely expensive for non-small data sets. It's not an option for serious > users. Can we call this the ERRORTABLE clause? > 5) allow an ERRORLIMIT to allow control of aborting a load after a certain > number of errors (and a pre-requisite for this is point number 2 above). The default for which would be ERRORLIMIT 0 to give backwards compatibility. 2) and 5) seem critical for combined usability & performance with real world data. I'm not clear from all of those options whether we still need a LOAD command, based upon other issues/comments raised on this thread. However, there are some other arguments for why it might be a good idea to have a LOAD DATA command separate from COPY. Certainly long term features would be easier to add with two commands. Trying to maintain backwards compatibility just because we use COPY seems like an uphill struggle and is going to mean we have to handle sensible new additions as options so we don't break existing applications. The most important one is the lock type held. [Oracle compatibility isn't one of them, even if it did provide the command name.] But things will be clearer when we see the patch. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| |||
| On Wed, 2005-06-01 at 18:32 -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > One idea would be to look at the table file size first. If it has zero > > blocks, lock the table and if it still has zero blocks, do the no-WAL > > copy. > > I think that's a bad idea. It would make the behavior unpredictable > --- sometimes a COPY will take an exclusive lock, and other times not; > and the reason why is at a lower semantic level than the user is > supposed to know about. > > Before you say "this is not important", consider the nontrivial risk > that the stronger lock will cause a deadlock failure. I don't think > that it's acceptable for lock strength to be unpredictable. While I agree with Bruce's sentiment, the locking is an issue with COPY. It is particularly important to be able to plan the concurrency within an application. That is why DB2's interesting ability to perform lock upgrades in a relatively unpredictable manner is not good. For CREATE TABLE AS SELECT, I think we could do this Bruce's way and win, since we would take the same grade of lock in both cases. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| |||
| Luke Lonergan wrote: > Bruce, > > > The problem with a new command is that it becomes unclear when you > > should use COPY and when LOAD DATA, and it confuses users, and has > > maintenance overhead. If Bizgres wants a new command name, go for it, > > but it is unlikely that the community release is going to go in that > > direction, unless there is a fundamental agreement that COPY is broken > > and needs a major revamp, and I have heard no talk of that. > > The question of whether COPY should be improved or whether the changes > should take the form of a new command is separate from the question of > whether the performance of the load path in PostgreSQL needs improvement. > > The 90% performance increase (from 12 MB/s to 21 MB/s) that Alon reported > comes from replacing the parsing logic within COPY. I believe that the > parsing logic in COPY is fundamentally broken from a performance > perspective, and may be broken from a functionality perspective WRT embedded > backslashes. COPY works as designed. The idea that some guy we have never heard of is going to appear and rewrite COPY's processing and tell us that the existing code is actually broken seems pretty arrogant to me. If it is broken (meaning doesn't work as designed), please show us facts rather than conjecture. Oh, and the "Our COPY improvements are so fundamental that they deserve a new command name" also has a similar flavor. (Please explain how you handle literal delimiters and nulls with no escape processing.) > One of the reasons to consider a LOAD DATA command is that we can isolate > the need for performance improvements and special syntax from the concerns > of preserving the legacy behavior of COPY for use as the primary mechanism > for DUMP and RESTORE. This seems like a case where GreenPlum's priorities and the community's priorities might not match. There is much more work required on your part if you are going to convince the community it needs a new data loading command, and starting out with the assumption in emails that it is going to be a newly named command isn't the best approach. That is my fundamental point. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > One idea would be to look at the table file size first. If it has zero > > blocks, lock the table and if it still has zero blocks, do the no-WAL > > copy. > > I think that's a bad idea. It would make the behavior unpredictable > --- sometimes a COPY will take an exclusive lock, and other times not; > and the reason why is at a lower semantic level than the user is > supposed to know about. > > Before you say "this is not important", consider the nontrivial risk > that the stronger lock will cause a deadlock failure. I don't think > that it's acceptable for lock strength to be unpredictable. Yea, but you are only doing the lock if the table is zero pages. Doesn't that help? Maybe not. I do like the LOCK keyword if we have to use one to enable this functionality, but I am suspecting people will want this functionality in pg_dump output. How do we do that? Just make it the default for pg_dump output? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| Simon Riggs wrote: > I'm not clear from all of those options whether we still need a LOAD > command, based upon other issues/comments raised on this thread. > > However, there are some other arguments for why it might be a good idea > to have a LOAD DATA command separate from COPY. Certainly long term > features would be easier to add with two commands. Trying to maintain > backwards compatibility just because we use COPY seems like an uphill > struggle and is going to mean we have to handle sensible new additions > as options so we don't break existing applications. The most important > one is the lock type held. Well, we have had a pretty much unmodified COPY format since like the Berkeley days (I added \N and \.). Please tell us exactly what you want do to that requires a format change, and we can talk about it, but showing up with no proof and expecting a new command is the _wrong_ approach. It actually reminds me of the "our company developed it so it must be great" approach, which doesn't work well in the community. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073 ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Simon Riggs wrote: > On Wed, 2005-06-01 at 10:35 -0700, Alon Goldshuv wrote: >>2) A modified command syntax for introducing a direct single row error >>handling. By direct I mean - a row that if rejected from within the COPY >>5) allow an ERRORLIMIT to allow control of aborting a load after a certain >>number of errors (and a pre-requisite for this is point number 2 above). > 2) and 5) seem critical for combined usability & performance with real > world data. I'll second that! This would be a huge win for one of my real world applications, whether implemented as a new command, or as added capability on top of COPY. The other performance enhancements would certainly be nice to have also, but in my experience not nearly as important as these two. Joe ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Bruce, The patch is not there to show that something is "broken" is it there to show how things could be done in another way, which may or may not be more desireable. > COPY works as designed. The idea that some guy we have never heard of > is going to appear and rewrite COPY's processing and tell us that the > existing code is actually broken seems pretty arrogant to me. If it is > broken (meaning doesn't work as designed), please show us facts rather > than conjecture. I am sure that the code works as designed. In my previous email I was referring to the fact that using COPY with a data field that happens to have backslashes in it, and then querying this data field you will get different results. For example do COPY of a field "c:\\one\ten\nine" and query for it. The result will be much different. This is a problem with clickstream data for example. That's all. It's very possible that there are reasons that I missed for why things are the way they are. > (Please explain how you handle literal delimiters and nulls with no > escape processing.) Escape processing is done only for these 2 cases. Thx, Alon. ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| ||||
| Tom, > > ... and instead, define some new behavior that will soon be considered > broken legacy code itself? I'll not argue further about whether to have a separate LOAD command. That's not as important as fixing the performance issues in the data load path in PostgreSQL to me. However, I find it compelling that the underlying problem confronting high performance data loading in the engine is the slow parse code in psql/copy, etc., and not the WAL or other issues. Next on the list are huge gains from removing things like repetitive calls to strlen(TZ) in the attribute conversion code. > There isn't any demand for changing the semantics of COPY, as far as > I've noticed. If we can make it faster with the same semantics that's > great, but I'm not in favor of inventing an alternate that does almost > the same thing but (eg) breaks backslash handling in the name of speed. A 540% increase in parsing speed while performing substantially the same logic is what the patch accomplishes. The patch is now available on pgsql-patches. In this case the speed did not come from breaking backslash handling, but rather from using faster processing for escape processing and delimiter parsing. We *could* recreate identical escape semantics to COPY at nearly the same speed, but frankly we're puzzled as to why character sequences with "\" automatically imply escape processing. We've found that there are many cases where more sophisticated escape processing options are necessary, including multi-byte delimiters and escape sequences, and the default use of "\" for escaping breaks many data load cases. - Luke ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |