Unix Technical Forum

Copy From & Insert UNLESS

This is a discussion on Copy From & Insert UNLESS within the pgsql Hackers forums, part of the PostgreSQL category; --> Greets folks, [YABLP: Yet Another Bulk Loading Proposal] The subject of this letter is referring to giving INSERT and ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 07:54 AM
James William Pye
 
Posts: n/a
Default Copy From & Insert UNLESS

Greets folks, [YABLP: Yet Another Bulk Loading Proposal]

The subject of this letter is referring to giving INSERT and COPY FROM STDIN
the ability to alter the destination of rows that violate any constraints named
in a user specified set.

I am seeking, as many others are or have, to improve the performance on bulk
loads to live systems where constraint violations may occur and filtering can
be done more efficiently within the backend.

Primarily, I'm concerned with UNIQUE violations. However, I think tackling the
general case is the wiser action.

The attached patch is *not* being included for any sort of application, and I
make no claims of it functioning as I intended it to or as I may imply it to. =)
The patch only served the purpose of providing rough numbers for the case of
unique violations.

Despite the fact that my experimental patch uses error trapping, that is *not*
what I have in mind for the implementation. I do not want to trap errors upon
insert or copy from. Rather, I wish to implement functionality that would allow
alternate destinations for tuples that violate user specified constraints on
the table, which, by default, will be to simply drop the tuple.

My proposed syntax is along the lines of:

INSERT INTO table [ ( column [, ...] ) ]
* [UNLESS CONSTRAINT VIOLATION
[ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
{ DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) | query }

and

COPY tablename [ ( column [, ...] ) ]
FROM { 'filename' | STDIN }
* [UNLESS CONSTRAINT VIOLATION
[ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
...

The purpose of the arguably unintuitive THEN INSERT INTO x portion is to provide
the mechanism in which a user can specify the destination table for tuples that
violated the associated set of constraints. Using the OR portion allows the user
to specify additional sets of constraints for different destinations.

A tuple will be withheld from the target table if ANY of the constraints
listed in any of the constraint_name sets is violated. Constraint sets should
not [may not?] reference the same constraint multiple times, even among
different sets.

Example:

\d dest_table
Table "public.dest_table"
Column | Type | Modifiers
--------+---------+-----------
i | integer | not null
j | integer |
Indexes:
"dest_table_pkey" PRIMARY KEY, btree (i)
Check constraints:
"dest_table_j_check" CHECK (j > 0)

CREATE TEMP TABLE pkey_failures (i int, j int);
CREATE TEMP TABLE check_failures (i int, j int);

COPY dest_table FROM STDIN
UNLESS CONSTRAINT VIOLATION
ON (dest_table_pkey) THEN INSERT INTO pkey_failures
OR (dest_table_j_check) THEN INSERT INTO check_failures;

For most constraints, this proposed implementation should be fairly easy to
implement. However, the B-Tree index has the uniqueness check within its
insert access method, _bt_check_unique. Perhaps the best solution here is to
expose this check function--with changes, of course--and define a new access
method entry, 'amcheck' or, perhaps, 'amscanforinsert' where state information
would be given back to the caller for later use in the actual insert. (Perhaps
evident, but I'm not as familiar with the index code as I would like to be for
this sort of speculation, so please excuse me if I am not making good sense.)

There is one other annoying change. The constraints specified in an INSERT
UNLESS should be checked before all other unspecified constraints. This is to
elegantly handle the insertion case where two violations can occur, one with a
constraint that the user specified, and one that the user didn't. Regardless of
the order in which constraints are collected for checking, the user specified
ones should be checked first to avoid unwelcome errors from being thrown when
the tuple was going to be tossed anyways.

This proposal purposefully does not discuss bad data errors as I think that
should be seen as a separate issue. Perhaps a future feature within the UNLESS
syntax.


Prior Discussions or Mentions [See the last two.]

implicit abort harmful?
http://archives.postgresql.org/pgsql...5/msg00962.php
how to continue a transaction after an error?
http://archives.postgresql.org/pgsql...1/msg00097.php
mass import to table with unique index
http://archives.postgresql.org/pgsql...1/msg01465.php
Duplicate key insert question
http://archives.postgresql.org/pgsql...7/msg00056.php

Ignore when using COPY FROM (Matthew Kennedy) (small, complaint/fr)
http://archives.postgresql.org/pgsql...8/msg00681.php
COPY and duplicates (Ryan Mahoney) (small, complaint/fr)
http://archives.postgresql.org/pgsql...7/msg00569.php

Bulk loading using COPY - ignore duplicates? (Lee Kindness?)
http://archives.postgresql.org/pgsql...1/msg00029.php
Practical error logging for very large COPY statements (Simon Riggs)
http://archives.postgresql.org/pgsql...1/msg01100.php


Extant Solutions

There are quite a few solutions to this problem as I'm sure many (all?) know:

.. Temporary table that filters out the evil tuples.
.. BEFORE TRIGGER handling the tuple if the constraint of interest is violated.
.. INSERT wrapped in a subtransaction.
.. (Other variations)

Temporary tables are probably the fastest here. However, it still exhibits
redundancy, and requires post-load tuple movement(extra step).

Savepoints require client side logic in order to detect the appropriate
error code to trap or raise. (Also, this seems to be quite slow, regardless.)

A before trigger is going to require that interested constraints be tested
twice and for code to be effectively duplicated.

So, there are quite a few ways to do error controlled bulk loading. Temporary
tables appear to be the best current solution. However, I think the
implementation described in this proposal will yield improvements by simply
reducing redundancy.


Simple Numbers [Using the attached, *very* experimental patch]:

(PostgreSQL 8.2 [~HEAD], redhat 9)

These numbers were rendered from a simple single client trial where psql
and the postmaster were running on the same machine. I ran each trial a
few times and I would cleanup and VACUUM FULL between runs. The numbers do
not include the cleanup or the vacuum.

The "Insert unless" trials are drawn from my attached patch.
(For the INSERT UNLESS cases I used the patch, otherwise a clean PG.)

CREATE TABLE foo (i int PRIMARY KEY);

[Straight SQL INSERTs (via psql and an SQL file)]
(100000 Violations)
Auto-Commit Transactions: 22.213
Savepoints: (ran out of shared memory)
Insert into temporary table, Merge: 24.225
* Insert Unless: 14.668

(50000 Violations, 50000 New)
Auto-Commit Transactions: 33.342
Savepoints: (untested)
Insert into temporary table, Merge: 24.243
* Insert Unless: 14.260

(100000 New)
Auto-Commit Transactions: 47.990
Savepoints: 3:05.60 (three minutes!!)
Temporary table: 26.178
* Insert Unless: 14.283

The numbers here look pretty good, especially for such a hackish patch.
[btw, I hope I screwed up somehow on the savepoints.]

However:

[COPY FROM STDIN (via psql and a file of data)]
(100000 Violations)
Copy Unless: 2.4132
Copy to temporary, Merge: 0.72675

(50000 Conflicts, 50000 New)
Copy Unless: 2.1145
Copy to temporary, Merge: 1.469

(100000 New)
Copy Unless: 1.6386
Copy to temporary, Merge: 2.4305

The numbers here don't look so good now.
However, I'm convinced that this is showing the inefficiencies in my current
hack, rather than dissolving the likelihood of the discussed implementation being
an improvement.

(The files that rendered these results are available on request. They are 100K a
piece after being bzip'd)
--
Regards, James William Pye


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 07:55 AM
Josh Berkus
 
Posts: n/a
Default Re: Copy From & Insert UNLESS

James,

> I am seeking, as many others are or have, to improve the performance on bulk
> loads to live systems where constraint violations may occur and filtering can
> be done more efficiently within the backend.
>
> Primarily, I'm concerned with UNIQUE violations. However, I think tackling the
> general case is the wiser action.


Alon Goldshuv on Bizgres has been working on this as well. Maybe you
could collaborate? Alon?

--Josh

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 07:55 AM
Josh Berkus
 
Posts: n/a
Default Re: Copy From & Insert UNLESS

Folks,

> The subject of this letter is referring to giving INSERT and COPY FROM STDIN
> the ability to alter the destination of rows that violate any constraints named
> in a user specified set.


BTW, just in case anyone thinks that James is not addressing a real and
widespread problem, Joe Conway said in his presentation on Symer's 1.2
TB databases that the single most painful thing they had to deal with in
the implementation is filtering out bad rows before COPY (from OSCON2005
presentation).

--Josh

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 07:55 AM
Stephan Szabo
 
Posts: n/a
Default Re: Copy From & Insert UNLESS


On Fri, 3 Feb 2006, James William Pye wrote:

> Despite the fact that my experimental patch uses error trapping, that is *not*
> what I have in mind for the implementation. I do not want to trap errors upon
> insert or copy from. Rather, I wish to implement functionality that would allow
> alternate destinations for tuples that violate user specified constraints on
> the table, which, by default, will be to simply drop the tuple.
>
> My proposed syntax is along the lines of:
>
> INSERT INTO table [ ( column [, ...] ) ]
> * [UNLESS CONSTRAINT VIOLATION
> [ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
> { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) | query }
>
> and
>
> COPY tablename [ ( column [, ...] ) ]
> FROM { 'filename' | STDIN }
> * [UNLESS CONSTRAINT VIOLATION
> [ON (constraint_name [, ...]) [THEN INSERT INTO table]] [, OR ...]]
> ...
>
> The purpose of the arguably unintuitive THEN INSERT INTO x portion is to provide
> the mechanism in which a user can specify the destination table for tuples that
> violated the associated set of constraints. Using the OR portion allows the user
> to specify additional sets of constraints for different destinations.
>
> A tuple will be withheld from the target table if ANY of the constraints
> listed in any of the constraint_name sets is violated. Constraint sets should
> not [may not?] reference the same constraint multiple times, even among
> different sets.
>
> Example:
>
> \d dest_table
> Table "public.dest_table"
> Column | Type | Modifiers
> --------+---------+-----------
> i | integer | not null
> j | integer |
> Indexes:
> "dest_table_pkey" PRIMARY KEY, btree (i)
> Check constraints:
> "dest_table_j_check" CHECK (j > 0)
>
> CREATE TEMP TABLE pkey_failures (i int, j int);
> CREATE TEMP TABLE check_failures (i int, j int);
>
> COPY dest_table FROM STDIN
> UNLESS CONSTRAINT VIOLATION
> ON (dest_table_pkey) THEN INSERT INTO pkey_failures
> OR (dest_table_j_check) THEN INSERT INTO check_failures;
>
> For most constraints, this proposed implementation should be fairly easy to
> implement.


Have you considered how this might work with spec-compliant constraint
timing? I think even in inserting cases, a later trigger before statement
end could in some cases un-violate a constraint, so checking before insert
won't actually be the same behavior as the normal constraint handling
which seems bad for this kind of system.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 07:55 AM
James William Pye
 
Posts: n/a
Default Re: Copy From & Insert UNLESS

On Sun, Feb 05, 2006 at 02:08:12PM -0800, Stephan Szabo wrote:
> Have you considered how this might work with spec-compliant constraint
> timing?


I haven't gone so far as to look into the spec, yet. [Noise of rustling papers]

However, constraints referenced in an UNLESS clause that are deferred, in any
fashion, should probably be "immediated" within the context of the command.
Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS were
to actually alter the timing of a given constraint.

> I think even in inserting cases, a later trigger before statement
> end could in some cases un-violate a constraint, so checking before insert
> won't actually be the same behavior as the normal constraint handling
> which seems bad for this kind of system.


Any facility that can alter the tuple before it being inserted into the heap
should probably be exercised prior to the application of the tuple against
UNLESS's behavior. The implementation of UNLESS will probably completely change
ExecConstraints(), which comes after the firing of BEFORE triggers and before
heap_insert(). Beyond that, I am not sure what other considerations should be
made with respect to triggers. So, UNLESS should/will be applied after BEFORE
triggers, but before "non-UNLESS specified" constraints.
--
Regards, James William Pye

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 07:55 AM
Stephan Szabo
 
Posts: n/a
Default Re: Copy From & Insert UNLESS


On Sun, 5 Feb 2006, James William Pye wrote:

> On Sun, Feb 05, 2006 at 02:08:12PM -0800, Stephan Szabo wrote:
> > Have you considered how this might work with spec-compliant constraint
> > timing?

>
> I haven't gone so far as to look into the spec, yet. [Noise of rustling papers]
>
> However, constraints referenced in an UNLESS clause that are deferred, in any
> fashion, should probably be "immediated" within the context of the command.
> Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS were
> to actually alter the timing of a given constraint.


The problem is that even immediate constraints are supposed to be checked
at end of statement, not at row time. Our implementation of UNIQUE is
particularly bad for this. Basically a violation at the time the row is
created is irrelevant if the violation is gone by the end of statement.

> > I think even in inserting cases, a later trigger before statement
> > end could in some cases un-violate a constraint, so checking before insert
> > won't actually be the same behavior as the normal constraint handling
> > which seems bad for this kind of system.

>
> Any facility that can alter the tuple before it being inserted into the heap
> should probably be exercised prior to the application of the tuple against
> UNLESS's behavior.


The problem is that you can un-violate a unique constraint by changing
some other row that's already in the table. And I think that it might even
be legal to do so in an after trigger (and in fact, some other row's after
trigger).

This isn't necessarily a killer to the idea though, it probably just means
the semantics are harder to nail down.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-11-2008, 07:55 AM
James William Pye
 
Posts: n/a
Default Re: Copy From & Insert UNLESS

On Sun, Feb 05, 2006 at 07:14:49PM -0800, Stephan Szabo wrote:
> On Sun, 5 Feb 2006, James William Pye wrote:
> > However, constraints referenced in an UNLESS clause that are deferred, in any
> > fashion, should probably be "immediated" within the context of the command.
> > Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS were
> > to actually alter the timing of a given constraint.

>
> The problem is that even immediate constraints are supposed to be checked
> at end of statement, not at row time.


I see. "Immediated" is not the word that I am actually looking for then.
Perhaps Postgres should specify our current immediate as a "new" constraint mode.
"instant", maybe? Sadly, I think it will be difficult to get away from using that or
some other synonym if such an idea were to be "implemented".

[Getting the feeling that this has been discussed before. ;]

> Our implementation of UNIQUE is particularly bad for this.


Yes. Changing how UNIQUE constraints are implemented will likely be the first
step in this patch.

> > Any facility that can alter the tuple before it being inserted into the heap
> > should probably be exercised prior to the application of the tuple against
> > UNLESS's behavior.

>
> The problem is that you can un-violate a unique constraint by changing
> some other row that's already in the table. And I think that it might even
> be legal to do so in an after trigger (and in fact, some other row's after
> trigger).
> [join]
> Basically a violation at the time the row is
> created is irrelevant if the violation is gone by the end of statement.


Okay. I can't help but think such a trigger as being questionable at best.
However, per spec, it should be possible. =\

> This isn't necessarily a killer to the idea though, it probably just means
> the semantics are harder to nail down.


Aye. I figured there would be some details that might take a while.


Once the UNIQUE constraint code is relocated, I think implementing more
standards compliant constraint timing might be substantially easier. However, I
don't think this should effect UNLESS. Rather, I think UNLESS should, more or
less, demand that specified constraints be checked at the same time as they are
currently. This is meant to be an optimization at multiple levels; reduce code
redundancy(rewriting constraint checks for use prior to the actual insertion),
computational redundancy(potentially, running the rewritten checks more than
once), and reduce unnecessary I/O(avoiding heap_insert()'ing an evil tuple
into the target table despite the fact that the statement may later "inviolate"
it). Although, perhaps, it could be configurable with an option;
"INSERT INTO t UNLESS [DEFERRED] CONSTRAINT VIOLATION". =)
--
Regards, James William Pye

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-11-2008, 07:55 AM
Josh Berkus
 
Posts: n/a
Default Re: Copy From & Insert UNLESS

James,

Are you sure that a new type of constraint is the way to go for this?
It doesn't solve our issues in the data warehousing space. The spec we
started with for "Error-tolerant COPY" is:

1) It must be able to handle parsing errors (i.e. bad char set);
2) It must be able to handle constraint violations;
3) It must output all row errors to a log or "errors" table which makes
it possible to determine which input row failed and why;
4) It must not slow significantly (like, not more than 15%) the speed of
bulk loading.

On that basis, Alon started working on a low-level error trapper for
COPY. It seems like your idea, which would involve a second constraint
check, would achieve neigher #1 nor #4.

--Josh Berkus

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-11-2008, 07:55 AM
Stephan Szabo
 
Posts: n/a
Default Re: Copy From & Insert UNLESS

On Mon, 6 Feb 2006, Josh Berkus wrote:

> Are you sure that a new type of constraint is the way to go for this?
> It doesn't solve our issues in the data warehousing space. The spec we
> started with for "Error-tolerant COPY" is:
>
> 1) It must be able to handle parsing errors (i.e. bad char set);
> 2) It must be able to handle constraint violations;
> 3) It must output all row errors to a log or "errors" table which makes
> it possible to determine which input row failed and why;
> 4) It must not slow significantly (like, not more than 15%) the speed of
> bulk loading.
>
> On that basis, Alon started working on a low-level error trapper for
> COPY. It seems like your idea, which would involve a second constraint
> check, would achieve neigher #1 nor #4.


I think in his system it wouldn't check the constraints twice, it'd just
potentially check them at a different time than the normal constraint
timing, so I think it'd cover #4. I'd wonder if there'd be any possibility
of having violations get unnoticed in that case, but I'm not coming up
with an obvious way that could happen.


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-11-2008, 07:55 AM
Stephan Szabo
 
Posts: n/a
Default Re: Copy From & Insert UNLESS

On Mon, 6 Feb 2006, James William Pye wrote:

> On Sun, Feb 05, 2006 at 07:14:49PM -0800, Stephan Szabo wrote:
> > On Sun, 5 Feb 2006, James William Pye wrote:
> > > However, constraints referenced in an UNLESS clause that are deferred, in any
> > > fashion, should probably be "immediated" within the context of the command.
> > > Perhaps a WARNING or NOTICE would be appropriately informative if UNLESS were
> > > to actually alter the timing of a given constraint.

> >
> > The problem is that even immediate constraints are supposed to be checked
> > at end of statement, not at row time.

>
> I see. "Immediated" is not the word that I am actually looking for then.
> Perhaps Postgres should specify our current immediate as a "new" constraint mode.
> "instant", maybe? Sadly, I think it will be difficult to get away from using that or
> some other synonym if such an idea were to be "implemented".
>
> [Getting the feeling that this has been discussed before. ;]


Only parts of it.

> > Our implementation of UNIQUE is particularly bad for this.

>
> Yes. Changing how UNIQUE constraints are implemented will likely be the first
> step in this patch.
>
> > > Any facility that can alter the tuple before it being inserted into the heap
> > > should probably be exercised prior to the application of the tuple against
> > > UNLESS's behavior.

> >
> > The problem is that you can un-violate a unique constraint by changing
> > some other row that's already in the table. And I think that it might even
> > be legal to do so in an after trigger (and in fact, some other row's after
> > trigger).
> > [join]
> > Basically a violation at the time the row is
> > created is irrelevant if the violation is gone by the end of statement.

>
> Okay. I can't help but think such a trigger as being questionable at best.
> However, per spec, it should be possible. =\


Yeah, it's pretty odd in the insert case. It's easy in the update case to
make a case where it matters, definately less so for insert.

> Once the UNIQUE constraint code is relocated, I think implementing more
> standards compliant constraint timing might be substantially easier. However, I
> don't think this should effect UNLESS. Rather, I think UNLESS should, more or
> less, demand that specified constraints be checked at the same time as they are
> currently. This is meant to be an optimization at multiple levels; reduce code
> redundancy(rewriting constraint checks for use prior to the actual insertion),
> computational redundancy(potentially, running the rewritten checks more than
> once), and reduce unnecessary I/O(avoiding heap_insert()'ing an evil tuple
> into the target table despite the fact that the statement may later "inviolate"
> it). Although, perhaps, it could be configurable with an option;
> "INSERT INTO t UNLESS [DEFERRED] CONSTRAINT VIOLATION". =)


I'd say that if we were going to check the constraints at a different
time, we'd want a better name/description than "UNLESS CONSTRAINT
VIOLATION" since the unadorned INSERT or COPY might run with no constraint
violations.

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:41 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com