This is a discussion on Re: Cost of XLogInsert CRC calculations within the pgsql Hackers forums, part of the PostgreSQL category; --> Tom Lane <tgl@sss.pgh.pa.us> writes: > It's not really a matter of backstopping the hardware's error detection; > if we ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Tom Lane <tgl@sss.pgh.pa.us> writes: > It's not really a matter of backstopping the hardware's error detection; > if we were trying to do that, we'd keep a CRC for every data page, which > we don't. The real reason for the WAL CRCs is as a reliable method of > identifying the end of WAL: when the "next record" doesn't checksum you > know it's bogus. This is a nontrivial point because of the way that we > re-use WAL files --- the pages beyond the last successfully written page > aren't going to be zeroes, they'll be filled with random WAL data. Is the random WAL data really the concern? It seems like a more reliable way of dealing with that would be to just accompany every WAL entry with a sequential id and stop when the next id isn't the correct one. I thought the problem was that if the machine crashed in the middle of writing a WAL entry you wanted to be sure to detect that. And there's no guarantee the fsync will write out the WAL entry in order. So it's possible the end (and beginning) of the WAL entry will be there but the middle still have been unwritten. The only truly reliable way to handle this would require two fsyncs per transaction commit which would be really unfortunate. > Personally I think CRC32 is plenty for this job, but there were those > arguing loudly for CRC64 back when we made the decision originally ... So given the frequency of database crashes and WAL replays if having one failed replay every few million years is acceptable I think 32 bits is more than enough. Frankly I think 16 bits would be enough. -- greg ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >> It's not really a matter of backstopping the hardware's error detection; >> if we were trying to do that, we'd keep a CRC for every data page, which >> we don't. The real reason for the WAL CRCs is as a reliable method of >> identifying the end of WAL: when the "next record" doesn't checksum you >> know it's bogus. > Is the random WAL data really the concern? It seems like a more reliable way > of dealing with that would be to just accompany every WAL entry with a > sequential id and stop when the next id isn't the correct one. We do that, too (the xl_prev links and page header addresses serve that purpose). But it's not sufficient given that WAL records can span pages and therefore may be incompletely written. > The only truly reliable way to handle this would require two fsyncs per > transaction commit which would be really unfortunate. How are two fsyncs going to be better than one? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| |||
| Tom Lane <tgl@sss.pgh.pa.us> writes: > > Is the random WAL data really the concern? It seems like a more reliable way > > of dealing with that would be to just accompany every WAL entry with a > > sequential id and stop when the next id isn't the correct one. > > We do that, too (the xl_prev links and page header addresses serve that > purpose). But it's not sufficient given that WAL records can span pages > and therefore may be incompletely written. Right, so the problem isn't that there may be stale data that would be unrecognizable from real data. The problem is that the real data may be partially there but not all there. > > The only truly reliable way to handle this would require two fsyncs per > > transaction commit which would be really unfortunate. > > How are two fsyncs going to be better than one? Well you fsync the WAL entry and only when that's complete do you flip a bit marking the WAL entry as commited and fsync again. Hm, you might need three fsyncs, one to make sure the bit isn't set before writing out the WAL record itself. -- greg ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Greg Stark <gsstark@mit.edu> writes: > Tom Lane <tgl@sss.pgh.pa.us> writes: >>> Is the random WAL data really the concern? It seems like a more reliable way >>> of dealing with that would be to just accompany every WAL entry with a >>> sequential id and stop when the next id isn't the correct one. >> >> We do that, too (the xl_prev links and page header addresses serve that >> purpose). But it's not sufficient given that WAL records can span pages >> and therefore may be incompletely written. Actually, on reviewing the code I notice two missed bets here. 1. During WAL replay, we aren't actually verifying that xl_prev matches the address of the prior WAL record. This means we are depending only on the page header addresses to make sure we don't replay stale WAL data left over from the previous cycle of use of the physical WAL file. This is fairly dangerous, considering the likelihood of partial write of a WAL page during a power failure: the first 512-byte sector(s) of a page may have been updated but not the rest. If an old WAL record happens to start at exactly the sector boundary then we lose. 2. We store a separate CRC for each backup block attached to a WAL record. Therefore the same torn-page problem could hit us if a stale backup block starts exactly at a intrapage sector boundary --- there is nothing guaranteeing that the backup block really goes with the WAL record. #1 seems like a pretty critical, but also easily fixed, bug. To fix #2 I suppose we'd have to modify the WAL format to store just one CRC covering the whole of a WAL record and its attached backup block(s). I think the reasoning behind the separate CRCs was to put a limit on the amount of data guarded by one CRC, and thereby minimize the risk of undetected errors. But using the CRCs in this way is failing to guard against exactly the problem that we want the CRCs to guard against in the first place, namely torn WAL records ... so worrying about detection reliability seems misplaced. The odds of an actual failure from case #2 are fortunately not high, since a backup block will necessarily span across at least one WAL page boundary and so we should be able to detect stale data by noting that the next page's header address is wrong. (If it's not wrong, then at least the first sector of the next page is up-to-date, so if there is any tearing the CRC should tell us.) Therefore I don't feel any need to try to work out a back-patchable solution for #2. But I do think we ought to change the WAL format going forward to compute just one CRC across a WAL record and all attached backup blocks. There was talk of allowing compression of backup blocks, and if we do that we could no longer feel any certainty that a page crossing would occur. Thoughts? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| |||
| On Tue, 31 May 2005 12:07:53 +0100, "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote: >Perhaps Manfred can tell us the generator >polynomial that was used to create the lookup tables? 32 26 23 22 16 12 11 10 8 7 5 4 2 1 X + X + X + X + X + X + X + X + X + X + X + X + X + X + 1 -> http://www.opengroup.org/onlinepubs/...ies/cksum.html Or google for "04c11db7". Servus Manfred ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| |||
| On Tue, 2005-05-31 at 12:27 -0400, Tom Lane wrote: > Greg Stark <gsstark@mit.edu> writes: > > Tom Lane <tgl@sss.pgh.pa.us> writes: > >>> Is the random WAL data really the concern? It seems like a more reliable way > >>> of dealing with that would be to just accompany every WAL entry with a > >>> sequential id and stop when the next id isn't the correct one. > >> > >> We do that, too (the xl_prev links and page header addresses serve that > >> purpose). But it's not sufficient given that WAL records can span pages > >> and therefore may be incompletely written. > > Actually, on reviewing the code I notice two missed bets here. > > 1. During WAL replay, we aren't actually verifying that xl_prev matches > the address of the prior WAL record. This means we are depending only > on the page header addresses to make sure we don't replay stale WAL data > left over from the previous cycle of use of the physical WAL file. This > is fairly dangerous, considering the likelihood of partial write of a > WAL page during a power failure: the first 512-byte sector(s) of a page > may have been updated but not the rest. If an old WAL record happens to > start at exactly the sector boundary then we lose. Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't used, but passed that by. Damn. > 2. We store a separate CRC for each backup block attached to a WAL > record. Therefore the same torn-page problem could hit us if a stale > backup block starts exactly at a intrapage sector boundary --- there is > nothing guaranteeing that the backup block really goes with the WAL > record. > > #1 seems like a pretty critical, but also easily fixed, bug. To fix #2 > I suppose we'd have to modify the WAL format to store just one CRC > covering the whole of a WAL record and its attached backup block(s). > > I think the reasoning behind the separate CRCs was to put a limit on > the amount of data guarded by one CRC, and thereby minimize the risk > of undetected errors. But using the CRCs in this way is failing to > guard against exactly the problem that we want the CRCs to guard against > in the first place, namely torn WAL records ... so worrying about > detection reliability seems misplaced. > > The odds of an actual failure from case #2 are fortunately not high, > since a backup block will necessarily span across at least one WAL page > boundary and so we should be able to detect stale data by noting that > the next page's header address is wrong. (If it's not wrong, then at > least the first sector of the next page is up-to-date, so if there is > any tearing the CRC should tell us.) Therefore I don't feel any need > to try to work out a back-patchable solution for #2. But I do think we > ought to change the WAL format going forward to compute just one CRC > across a WAL record and all attached backup blocks. There was talk of > allowing compression of backup blocks, and if we do that we could no > longer feel any certainty that a page crossing would occur. > > Thoughts? PreAllocXLog was already a reason to have somebody prepare new xlog files ahead of them being used. Surely the right solution here is to have that agent prepare fresh/zeroed files prior to them being required. That way no stale data can ever occur and both of these bugs go away.... Fixing that can be backpatched so that the backend that switches files can do the work, rather than bgwriter [ or ?]. Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| Simon Riggs <simon@2ndquadrant.com> writes: > Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't > used, but passed that by. Damn. I couldn't believe it'd been overlooked this long, either. It's the sort of thing that you assume got done the first time :-( > PreAllocXLog was already a reason to have somebody prepare new xlog > files ahead of them being used. Surely the right solution here is to > have that agent prepare fresh/zeroed files prior to them being required. Uh, why? That doubles the amount of physical I/O required to maintain the WAL, and AFAICS it doesn't really add any safety that we can't get in a more intelligent fashion. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Tue, 2005-05-31 at 22:36 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Hmmm. I seem to recall asking myself why xl_prev existed if it wasn't > > used, but passed that by. Damn. > > I couldn't believe it'd been overlooked this long, either. It's the > sort of thing that you assume got done the first time :-( Guess it shows how infrequently PostgreSQL crashes and recovers. > > PreAllocXLog was already a reason to have somebody prepare new xlog > > files ahead of them being used. Surely the right solution here is to > > have that agent prepare fresh/zeroed files prior to them being required. > > Uh, why? That doubles the amount of physical I/O required to maintain > the WAL, and AFAICS it doesn't really add any safety that we can't get > in a more intelligent fashion. OK, I agree that the xl_prev linkage is the more intelligent way to go. If I/O is a problem, then surely you will agree that PreAllocXLog is still required and should not be run by a backend? Thats going to show as a big response time spike for that user. Thats the last bastion - the other changes are gonna smooth response times right down, so can we do something with PreAllocXLog too? Best Regards, Simon Riggs ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| > -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: 31 May 2005 17:27 > To: Greg Stark > Cc: Mark Cave-Ayland (External); 'Manfred Koizar'; 'Bruce > Momjian'; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Cost of XLogInsert CRC calculations (cut) > The odds of an actual failure from case #2 are fortunately > not high, since a backup block will necessarily span across > at least one WAL page boundary and so we should be able to > detect stale data by noting that the next page's header > address is wrong. (If it's not wrong, then at least the > first sector of the next page is up-to-date, so if there is > any tearing the CRC should tell us.) Therefore I don't feel > any need to try to work out a back-patchable solution for #2. > But I do think we ought to change the WAL format going > forward to compute just one CRC across a WAL record and all > attached backup blocks. There was talk of allowing > compression of backup blocks, and if we do that we could no > longer feel any certainty that a page crossing would occur. I must admit I didn't realise that an XLog record consisted of a number of backup blocks which were also separately CRCd. I've been through the source, and while the XLog code is reasonably well commented, I couldn't find a README in the transam/ directory that explained the thinking behind the current implementation - is this something that was discussed on the mailing lists way back in the mists of time? I'm still a little nervous about dropping down to CRC32 from CRC64 and so was just wondering what the net saving would be using one CRC64 across the whole WAL record? For example, if an insert or an update uses 3 backup blocks then this one change alone would immediately reduce the CPU usage to one third of its original value? (something tells me that this is probably not the case as I imagine you would have picked this up a while back). In my view, having a longer CRC is like buying a holiday with insurance - you pay the extra cost knowing that should anything happen then you have something to fall back on. However, the hard part here is determining a reasonable balance betweem the cost and the risk. Kind regards, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 797131 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| Simon Riggs <simon@2ndquadrant.com> writes: > If I/O is a problem, then surely you will agree that PreAllocXLog is > still required and should not be run by a backend? It is still required, but it isn't run by backends --- it's fired off during checkpoints. I think there was some discussion recently about making it more aggressive about allocating future segments; which strikes me as a good idea. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| Thread Tools | |
| Display Modes | |
|
|