Unix Technical Forum

POSIX file updates

This is a discussion on POSIX file updates within the Pgsql Performance forums, part of the PostgreSQL category; --> On Wed, 2 Apr 2008, James Mansion wrote: >> But amusingly, PostgreSQL doesn't even support Solaris's direct I/O >> ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-19-2008, 11:46 AM
Greg Smith
 
Posts: n/a
Default Re: POSIX file updates

On Wed, 2 Apr 2008, James Mansion wrote:

>> But amusingly, PostgreSQL doesn't even support Solaris's direct I/O
>> method right now unless you override the filesystem mounting options,
>> so you end up needing to split it out and hack at that level
>> regardless.

> Indeed that's a shame. Why doesn't it use the directio?


You turn on direct I/O differently under Solaris then everywhere else, and
nobody has bothered to write the patch (trivial) and OS-specific code to
turn it on only when appropriate (slightly tricker) to handle this case.
There's not a lot of pressure on PostgreSQL to handle this case correctly
when Solaris admins are used to doing direct I/O tricks on filesystems
already, so they don't complain about it much.

> Yes but fsync and stable on disk isn't the same thing if there is a
> cache anywhere is it? Hence the fuss a while back about Apple's control
> of disk caches. Solaris and Windows do it too.


If your caches don't honor fsync by making sure it's on disk or a
battery-backed cache, you can't use them and expect PostgreSQL to operate
reliably. Back to that "doesn't honor the contract" case. The code that
implements fsync_writethrough on both Windows and Mac OS handles those two
cases by writing with the appropriate flags to not get cached in a harmful
way. I'm not aware of Solaris doing anything stupid here--the last two
Solaris x64 systems I've tried that didn't have a real controller write
cache ignored the drive cache and blocked at fsync just as expected,
limiting commits to the RPM of the drive. Seen it on UFS and ZFS, both
seem to do the right thing here.

> Isn't allowing the OS to accumulate an arbitrary number of dirty blocks
> without control of the rate at which they spill to media just exposing a
> possibility of an IO storm when it comes to checkpoint time? Does
> bgwriter attempt to control this with intermediate fsync (and push to
> media if available)?


It can cause exactly such a storm. If you haven't noticed my other paper
at http://www.westnet.com/~gsmith/conte...ux-pdflush.htm yet it goes
over this exact issue as far as how Linux handles it. Now that it's easy
to get even a home machine to have 8GB of RAM in it, Linux will gladly
buffer ~800MB worth of data for you and cause a serious storm at fsync
time. It's not pretty when that happens into a single SATA drive because
there's typically plenty of seeks in that write storm too.

There was a prototype implementation plan that wasn't followed completely
through in 8.3 to spread fsyncs out a bit better to keep this from being
as bad. That optimization might make it into 8.4 but I don't know that
anybody is working on it. The spread checkpoints in 8.3 are so much
better than 8.2 that many are happy to at least have that.

> It strikes me as odd that fsync_writethrough isn't the most preferred
> option where it is implemented.


It's only available on Win32 and Mac OS X (the OSes that might get it
wrong without that nudge). I believe every path through the code uses it
by default on those platforms, there's a lot of remapping in there.

You can get an idea of what code was touched by looking at the patch that
added the OS X version of fsync_writethrough (it was previously only
Win32): http://archives.postgresql.org/pgsql...5/msg00208.php

> The postgres approach of *requiring* that there be no cache below the OS
> is problematic, especially since the battery backup on internal array
> controllers is hardly the handiest solution when you find the mobo has
> died.


If the battery backup cache doesn't survive being moved to another machine
after a motherboard failure, it's not very good. The real risk to be
concerned about is what happens if the card itself dies. If that happens,
you can't help but lose transactions.

You seem to feel that there is an alternative here that PostgreSQL could
take but doesn't. There is not. You either wait until writes hit disk,
which by physical limitations only happens at RPM speed and therefore is
too slow to commit for many cases, or you cache in the most reliable
memory you've got and hope for the best. No software approach can change
any of that.

> And especially when the inability to flush caches on modern SATA and SAS
> drives would appear to be more a failing in some operating systems than
> in the drives themselves..


I think you're extrapolating too much from the Win32/Apple cases here.
There are plenty of cases where the so-called "lying" drives themselves
are completely stupid on their own regardless of operating system.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-19-2008, 11:46 AM
Greg Smith
 
Posts: n/a
Default Re: POSIX file updates

On Wed, 2 Apr 2008, James Mansion wrote:

> I'm well aware that there are battery-backed caches that can be detached
> from controllers and moved. But you'd better make darn sure you move
> all the drives and plug them in in exactly the right order and make sure
> they all spin up OK with the replaced cache, because its expecting them
> to be exactly as they were last time they were on the bus.


The better controllers tag the drives with a unique ID number so they can
route pending writes correctly even after such a disaster. This falls
into the category of tests people should do more often but don't: write
something into the cache, pull the power, rearrange the drives, and see if
everything still recovers.

> You would think hard drives could have enough capacitor store to dump
> cache to flash or the drive - if only to a special dump zone near where
> the heads park. They are spinning already after all.


The free market seems to have established that the preferred design model
for hard drives is that they be cheap and fast rather than focused on
reliability. I rather doubt the tiny percentage of the world who cares as
much about disk write integrity as database professionals do can possibly
make a big enough market to bother increasing the cost and design
complexity of the drive to do this.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-19-2008, 11:46 AM
James Mansion
 
Posts: n/a
Default Re: POSIX file updates

Greg Smith wrote:
> You turn on direct I/O differently under Solaris then everywhere else,
> and nobody has bothered to write the patch (trivial) and OS-specific
> code to turn it on only when appropriate (slightly tricker) to handle
> this case. There's not a lot of pressure on PostgreSQL to handle this
> case correctly when Solaris admins are used to doing direct I/O tricks
> on filesystems already, so they don't complain about it much.

I'm not sure that this will survive use of PostgreSQL on Solaris with
more users
on Indiana though. Which I'm hoping will happen
> RPM of the drive. Seen it on UFS and ZFS, both seem to do the right
> thing here.

But ZFS *is* smart enough to manage the cache, albeit sometimes with
unexpected
consequences as with the 2530 here http://milek.blogspot.com/.
> You seem to feel that there is an alternative here that PostgreSQL
> could take but doesn't. There is not. You either wait until writes
> hit disk, which by physical limitations only happens at RPM speed and
> therefore is too slow to commit for many cases, or you cache in the
> most reliable memory you've got and hope for the best. No software
> approach can change any of that.

Indeed I do, but the issue I have is that the problem is that some
popular operating
systems (lets try to avoid the flame war) fail to expose control of disk
caches and the
so the code assumes that the onus is on the admin and the documentation
rightly says
so. But this is as much a failure of the POSIX API and operating
systems to expose
something that's necessary and it seems to me rather valuable that the
application be
able to work with such facilities as they become available. Exposing the
flush cache
mechanisms isn't dangerous and can improve performance for non-dbms users of
the same drives.

I think manipulation of this stuff is a major concern for a DBMS that
might be
used by amateur SAs, and if at all possible it should work out of the
box on common
hardware. So far as I can tell, SQLServerExpress makes a pretty good
attempt
at it, for example It might be enough for initdb to whinge and fail if
it thinks the
disks are behaving insanely unless the wouldbe dba sets a
'my_disks_really_are_that_fast'
flag in the config. At the moment anyone can apt-get themselves a DBMS
which may
become a liability.

At the moment:
- casual use is likely to be unreliable
- uncontrolled deferred IO can result in almost DOS-like checkpoints

These affect other systems than PostgreSQL too - but would be avoidable
if the
drive cache flush was better exposed and the IO was staged to use it.
There's no
reason to block on anything but the final IO in a WAL commit after all,
and with
the deferred commit feature (which I really like for workflow engines)
intermediate
WAL writes of configured chunk size could let the WAL drives get on with it.
Admitedly I'm assuming a non-blocking write through - direct IO from a
background thread (process if you must) or aio.

> There are plenty of cases where the so-called "lying" drives
> themselves are completely stupid on their own regardless of operating
> system.

With modern NCQ capable drive firmware? Or just with older PATA stuff?
There's
an awful lot of fud out there about SCSI vs IDE still.

James


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:53 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com