This is a discussion on Re: Corrupt database? 8.1/FreeBSD6.0 within the Pgsql General forums, part of the PostgreSQL category; --> Looking backwards in the logs we see it a few other times this month... (Autovacuum occurring just prior)...same transaction ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Looking backwards in the logs we see it a few other times this month... (Autovacuum occurring just prior)...same transaction ID How could it be the same transaction ID from several days prior? Jan 2 03:05:04 prod-app-1 postgres[8524]: [4-1] 8524 LOG: autovacuum: processing database "template0" Jan 2 03:05:05 prod-app-1 postgres[8524]: [5-1] 8524 ERROR: could not access status of transaction 1924023481 Jan 2 03:05:05 prod-app-1 postgres[8524]: [5-2] 8524 DETAIL: could not open file "pg_clog/072A": No such file or directory Jeff Amiel <becauseimjeff@yahoo.com> wrote: "PostgreSQL 8.1.2 on i386-portbld-freebsd6.0, compiled by GCC cc (GCC) 3.4.4 [FreeBSD] 20050518" Started seeing this in the logs this afternoon...scaring the begeezus out of me. Jan 11 19:20:19 prod-app-1 postgres[1752]: [5-1] 1752 ERROR: could not access status of transaction 1924023481 Jan 11 19:20:19 prod-app-1 postgres[1752]: [5-2] 1752 DETAIL: could not open file "pg_clog/072A": No such file or directory Jan 11 19:24:35 prod-app-1 postgres[4094]: [5-1] 4094 ERROR: could not access status of transaction 1924023481 Jan 11 19:24:35 prod-app-1 postgres[4094]: [5-2] 4094 DETAIL: could not open file "pg_clog/072A": No such file or directory Jan 11 19:28:35 prod-app-1 postgres[6728]: [5-1] 6728 ERROR: could not access status of transaction 1924023481 Jan 11 19:28:35 prod-app-1 postgres[6728]: [5-2] 6728 DETAIL: could not open file "pg_clog/072A": No such file or directory I could find nothing any transaction refereced in pg_locks...nor could I find a file with that designation (was always the same transaction id) ls -l /db/pg_clog/ total 984 -rw------- 1 pgsql wheel 262144 Jan 11 09:55 07CF -rw------- 1 pgsql wheel 262144 Jan 11 13:45 07D0 -rw------- 1 pgsql wheel 262144 Jan 11 17:00 07D1 -rw------- 1 pgsql wheel 172032 Jan 11 20:39 07D2 At first I thought it was related to a constantly running batch process which I halted, and sure enough, the problem seemed to go away. If I restarted, the problem returned. I then tried select * from on most of the tables used by that process (at least the last days worth) and found no obvious issues or errors. About that time, I noticed that the errors were all IMMEDIATELY preceeded by an autovacuum of template0. So the logs actually looked like this: Jan 11 19:20:19 prod-app-1 postgres[1752]: [4-1] 1752 LOG: autovacuum: processing database "template0" Jan 11 19:20:19 prod-app-1 postgres[1752]: [5-1] 1752 ERROR: could not access status of transaction 1924023481 Jan 11 19:20:19 prod-app-1 postgres[1752]: [5-2] 1752 DETAIL: could not open file "pg_clog/072A": No such file or directory Jan 11 19:24:35 prod-app-1 postgres[4094]: [4-1] 4094 LOG: autovacuum: processing database "template0" Jan 11 19:24:35 prod-app-1 postgres[4094]: [5-1] 4094 ERROR: could not access status of transaction 1924023481 Jan 11 19:24:35 prod-app-1 postgres[4094]: [5-2] 4094 DETAIL: could not open file "pg_clog/072A": No such file or directory Jan 11 19:28:35 prod-app-1 postgres[6728]: [4-1] 6728 LOG: autovacuum: processing database "template0" Jan 11 19:28:35 prod-app-1 postgres[6728]: [5-1] 6728 ERROR: could not access status of transaction 1924023481 Jan 11 19:28:35 prod-app-1 postgres[6728]: [5-2] 6728 DETAIL: could not open file "pg_clog/072A": No such file or directory We've downloaded/compiled pg_filedump, but are stumped as to what relation (or even what database) to start with. I restarted the batch process that I thought was the culprit and the issue has 'gone away'. I'm sure there is some corruption somewhere (from reading various similar posts), but not sure where to start. Any help would be appreciated. --------------------------------- Access over 1 million songs - Yahoo! Music Unlimited. --------------------------------- Check out the all-new Yahoo! Mail beta - Fire up a more powerful email and get things done faster. |
| |||
| Jeff Amiel <becauseimjeff@yahoo.com> writes: > "PostgreSQL 8.1.2 on i386-portbld-freebsd6.0, compiled by GCC cc (GCC) 3.4.4 [FreeBSD] 20050518" > Jan 2 03:05:04 prod-app-1 postgres[8524]: [4-1] 8524 LOG: autovacuum: processing database "template0" > Jan 2 03:05:05 prod-app-1 postgres[8524]: [5-1] 8524 ERROR: could not access status of transaction 1924023481 > Jan 2 03:05:05 prod-app-1 postgres[8524]: [5-2] 8524 DETAIL: could not open file "pg_clog/072A": No such file or > directory > ls -l /db/pg_clog/ > total 984 > -rw------- 1 pgsql wheel 262144 Jan 11 09:55 07CF > -rw------- 1 pgsql wheel 262144 Jan 11 13:45 07D0 > -rw------- 1 pgsql wheel 262144 Jan 11 17:00 07D1 > -rw------- 1 pgsql wheel 172032 Jan 11 20:39 07D2 So apparently there's some row in template0 that didn't get marked committed before the pg_clog segment for it went away. Given 8.1's rather schizophrenic view of whether it can modify template0 or not, this is not too surprising, but I thought we'd put in some defense against this happening. Alvaro? Jeff, had you changed your autovac settings recently? Given that autovac seems to be trying to hit template0 every few minutes, it's hard to see how the tuple got missed for long enough to be a problem. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Tom Lane <tgl@sss.pgh.pa.us> wrote:So apparently there's some row in template0 that didn't get marked committed before the pg_clog segment for it went away. Given 8.1's rather schizophrenic view of whether it can modify template0 or not, this is not too surprising, but I thought we'd put in some defense against this happening. Alvaro? Jeff, had you changed your autovac settings recently? Given that autovac seems to be trying to hit template0 every few minutes, it's hard to see how the tuple got missed for long enough to be a problem. Sure enough I did make autovacuum more agressive about 30-45 days ago (have to check the logs to find the exact date). Was originally whatever default settings that came out of the box with 8.1. Naptime is currently set to 60 seconds. Am I to assume that this probably isn't the result of some operating system or filesystem misfeasance and that corrupt data in my 'real' databases is probably not an issue? --------------------------------- Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit. |
| |||
| Jeff Amiel wrote: > Sure enough I did make autovacuum more agressive about 30-45 days ago (have to check the logs to find the exact date). Was originally whatever default settings that came out of the box with 8.1. Naptime is currently set to 60 seconds. > > Am I to assume that this probably isn't the result of some operating system or filesystem misfeasance and that corrupt data in my 'real' databases is probably not an issue? Did you perchance connect to template0 (probably marking it as connectable in the process), made some modification there, and then mark it as non-connectable again, without executing VACUUM FREEZE on it? AFAICS we only execute VACUUM FREEZE on it, so we shouldn't leave any unfrozen tuples. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Alvaro Herrera <alvherre@commandprompt.com> wrote: Did you perchance connect to template0 (probably marking it as connectable in the process), made some modification there, and then mark it as non-connectable again, without executing VACUUM FREEZE on it? AFAICS we only execute VACUUM FREEZE on it, so we shouldn't leave any unfrozen tuples. Nope. Have never touched template0...haven't made any changes to pg_hba.conf (which is how I assume you would make template0 connectable) until last night when we determined that template0 might be at fault and we were going to run pg_filedump on it to see if we could find anything useful. I assume this is similar/identical to this issue: http://archives.postgresql.org/pgsql...3/msg01294.php Any ideas on how should I move forward? --------------------------------- Finding fabulous fares is fun. Let Yahoo! FareChase search your favorite travel sites to find flight and hotel bargains. |
| |||
| Jeff Amiel <becauseimjeff@yahoo.com> writes: > Any ideas on how should I move forward? Well, if the problem is indeed in pg_statistic, it'll be easy to repair (think TRUNCATE...). Have you turned up the logging level to find out? BTW, please don't do anything to try to correct the problem until we're pretty sure we understand how this happened --- we might ask you for more info. AFAICS this isn't having any bad effects except for bleats in your log file, so you can wait. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Tom Lane <tgl@sss.pgh.pa.us> wrote: Well, if the problem is indeed in pg_statistic, it'll be easy to repair (think TRUNCATE...). Have you turned up the logging level to find out? BTW, please don't do anything to try to correct the problem until we're pretty sure we understand how this happened --- we might ask you for more info. AFAICS this isn't having any bad effects except for bleats in your log file, so you can wait. We turned up the logging this morning.....(debug2) and are awaiting the issue to re-occur. Will report back then. Thanks. --------------------------------- It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. |
| ||||
| Tom Lane <tgl@sss.pgh.pa.us> wrote: BTW, please don't do anything to try to correct the problem until we're pretty sure we understand how this happened --- we might ask you for more info. AFAICS this isn't having any bad effects except for bleats in your log file, so you can wait. Happened again...however this time not appearingly related to an autovacuum. I have a past example of this almost identical event a few weeks ago. Here's what the logs look like with debug2 as the log_min level: Jan 13 08:26:53 prod-app-1 postgres[41795]: [40171-1] 41795 LOG: autovacuum: processing database "template1" Jan 13 08:26:53 prod-app-1 postgres[560]: [40171-1] 560 DEBUG: server process (PID 41794) exited with exit code 0 Jan 13 08:26:57 prod-app-1 postgres[563]: [915-1] 563 DEBUG: checkpoint starting Jan 13 08:26:57 prod-app-1 postgres[563]: [916-1] 563 DEBUG: recycled transaction log file "000000010000005D00000069" Jan 13 08:26:57 prod-app-1 postgres[563]: [917-1] 563 DEBUG: checkpoint complete; 0 transaction log file(s) added, 0 removed, 1 recycled Jan 13 08:27:02 prod-app-1 postgres[560]: [40172-1] 560 DEBUG: forked new backend, pid=42368 socket=8 Jan 13 08:27:02 prod-app-1 postgres[560]: [40173-1] 560 DEBUG: forked new backend, pid=42369 socket=8 Jan 13 08:27:02 prod-app-1 postgres[560]: [40174-1] 560 DEBUG: forked new backend, pid=42370 socket=8 Jan 13 08:27:02 prod-app-1 postgres[560]: [40175-1] 560 DEBUG: server process (PID 42370) exited with exit code 0 Jan 13 08:27:02 prod-app-1 postgres[560]: [40176-1] 560 DEBUG: forked new backend, pid=42371 socket=8 Jan 13 08:27:02 prod-app-1 postgres[560]: [40177-1] 560 DEBUG: server process (PID 42369) exited with exit code 0 Jan 13 08:27:02 prod-app-1 postgres[560]: [40178-1] 560 DEBUG: server process (PID 42371) exited with exit code 0 Jan 13 08:27:02 prod-app-1 postgres[560]: [40179-1] 560 DEBUG: forked new backend, pid=42372 socket=8 Jan 13 08:27:02 prod-app-1 postgres[560]: [40180-1] 560 DEBUG: server process (PID 42372) exited with exit code 0 Jan 13 08:27:02 prod-app-1 postgres[560]: [40181-1] 560 DEBUG: forked new backend, pid=42373 socket=8 Jan 13 08:27:02 prod-app-1 postgres[560]: [40182-1] 560 DEBUG: server process (PID 42373) exited with exit code 0 Jan 13 08:27:02 prod-app-1 postgres[560]: [40183-1] 560 DEBUG: forked new backend, pid=42374 socket=8 Jan 13 08:27:02 prod-app-1 postgres[560]: [40184-1] 560 DEBUG: server process (PID 42374) exited with exit code 0 Jan 13 08:27:02 prod-app-1 postgres[560]: [40185-1] 560 DEBUG: server process (PID 42368) exited with exit code 0 Jan 13 08:27:08 prod-app-1 postgres[560]: [40186-1] 560 DEBUG: forked new backend, pid=42375 socket=8 Jan 13 08:27:08 prod-app-1 postgres[560]: [40187-1] 560 DEBUG: server process (PID 42375) exited with exit code 0 Jan 13 08:27:23 prod-app-1 postgres[560]: [40188-1] 560 DEBUG: forked new backend, pid=42376 socket=8 Jan 13 08:27:23 prod-app-1 postgres[560]: [40189-1] 560 DEBUG: server process (PID 42376) exited with exit code 0 Jan 13 08:27:26 prod-app-1 postgres[92257]: [30259-1] jboss 92257 ERROR: could not access status of transaction 2107200825 Jan 13 08:27:26 prod-app-1 postgres[92257]: [30259-2] jboss 92257 DETAIL: could not open file "pg_clog/07D9": No such file or directory Jan 13 08:27:26 prod-app-1 postgres[92257]: [30259-3] jboss 92257 CONTEXT: SQL statement "DELETE FROM audit_metadata WHERE user_id <> -1" Jan 13 08:27:26 prod-app-1 postgres[92257]: [30259-4] PL/pgSQL function "disable_auditing" line 2 at SQL statement Jan 13 08:27:26 prod-app-1 postgres[92257]: [30259-5] jboss 92257 STATEMENT: select disable_auditing() Jan 13 08:27:38 prod-app-1 postgres[560]: [40190-1] 560 DEBUG: forked new backend, pid=42377 socket=8 Jan 13 08:27:38 prod-app-1 postgres[560]: [40191-1] 560 DEBUG: server process (PID 42377) exited with exit code 0 Jan 13 08:27:49 prod-app-1 postgres[560]: [40192-1] 560 DEBUG: forked new backend, pid=42378 socket=8 Jan 13 08:27:50 prod-app-1 postgres[560]: [40193-1] 560 DEBUG: forked new backend, pid=42379 socket=8 Jan 13 08:27:50 prod-app-1 postgres[560]: [40194-1] 560 DEBUG: forked new backend, pid=42380 socket=8 Jan 13 08:27:53 prod-app-1 postgres[560]: [40195-1] 560 DEBUG: forked new backend, pid=42381 socket=8 Jan 13 08:27:53 prod-app-1 postgres[42382]: [40196-1] 42382 LOG: autovacuum: processing database "postgres" Jan 13 08:27:53 prod-app-1 postgres[560]: [40196-1] 560 DEBUG: server process (PID 42381) exited with exit code 0 Jan 13 08:28:02 prod-app-1 postgres[560]: [40197-1] 560 DEBUG: forked new backend, pid=42951 socket=8 Jan 13 08:28:02 prod-app-1 postgres[560]: [40198-1] 560 DEBUG: forked new backend, pid=42952 socket=8 pg_clog dir looks like this: -rw------- 1 pgsql wheel 262144 Jan 13 05:41 07DA -rw------- 1 pgsql wheel 262144 Jan 13 08:06 07DB -rw------- 1 pgsql wheel 90112 Jan 13 08:51 07DC Now that table, audit_metadata, is a temporary table (when accessed by jboss as it is here). There is a 'rea'l table with the same name, but only used by batch processes that connect to postgres. Thoughts? --------------------------------- Food fight? Enjoy some healthy debate in the Yahoo! Answers Food & Drink Q&A. |