Unix Technical Forum

"SMgrRelation hashtable corrupted" failure identified

This is a discussion on "SMgrRelation hashtable corrupted" failure identified within the pgsql Hackers forums, part of the PostgreSQL category; --> We've seen a few reports of the above-mentioned error message from PG 8.0 testers, but up till now no ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 03:14 AM
Tom Lane
 
Posts: n/a
Default "SMgrRelation hashtable corrupted" failure identified

We've seen a few reports of the above-mentioned error message from
PG 8.0 testers, but up till now no one had come up with a reproducible
test case. I've now found a trivial example:

session 1: create table a1 (f1 varchar(128));
session 2: insert into a1 values('abc');
session 1: alter table a1 alter column f1 type varchar(256);
session 2: insert into a1 values('abcd');
session 2 fails with ERROR: SMgrRelation hashtable corrupted
continued use of session 2 leads to a crash

Many if not all scenarios involving a rewriting ALTER TABLE on a
table in active use by other backends will fail like this.
I believe there are probably similar failures involving CLUSTER,
though a quick try didn't show it. This seems clearly to be a
"must fix for 8.0" bug.

The basic problem is that when ALTER TABLE tries to swap the physical
files associated with the original table and the temp version of the
table, it sends out relcache inval events for all four combinations
of table OID and relfilenode. Because inval.c is a bit cavalier about
the ordering of inval events, the one that session 2 sees first is the
one for <temp table OID, old relfilenode>. It does not find a relcache
entry for the temp table OID, but it does find an smgr table entry for
the relfilenode, which it proceeds to drop. Now there is a dangling
smgr reference in its relcache, so when it next gets hit with a
relcache clear event for the original table OID, boom!

I fooled around with trying to patch this by enforcing the "right"
processing order of inval events, but that doesn't work (it just moves
the failure into the sending backend, which it turns out would need
a different processing order to avoid crashing). It would be a horribly
fragile solution anyway.

I now think that the only reasonable fix is to directly attack the
problem of dangling relcache references to smgr table entries. What we
can do is add a concept of an "owning pointer" to an smgr entry, that
is an "SMgrRelation *myowner" field, and have smgrclose do
something like
if (reln->myowner)
*(reln->myowner) = NULL;
For smgr table entries associated with a relcache entry, the relcache
code would set this field as a back link to its rel->rd_smgr pointer.
With this setup, an smgr-level clear would correctly unhook from the
relcache even if the clear did not come directly through the relcache.
This would simplify RelationCacheInvalidateEntry and
LocalExecuteInvalidationMessage, which could then treat relcache clear
and smgr clear as independent operations.

Comments?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 03:14 AM
Tom Lane
 
Posts: n/a
Default Re: "SMgrRelation hashtable corrupted" failure identified

"Marc G. Fournier" <scrappy@postgresql.org> writes:
> On Mon, 10 Jan 2005, Tom Lane wrote:
>> Comments?


> Only: Josh, put a hold on those press releases, looks like an RC5 is
> forthcoming ...


I knew you were going to say that ;-)

I'm not sure if we should insist on an RC5 for this or not. If we'd
found it after release we'd have stuck it into 8.0.1 without any special
extra testing.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:42 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com