Unix Technical Forum

RAID arrays and performance

This is a discussion on RAID arrays and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> I have a question about how Postgres makes use of RAID arrays for performance, because we are considering buying ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

Register FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 10:42 AM
Matthew
 
Posts: n/a
Default RAID arrays and performance


I have a question about how Postgres makes use of RAID arrays for
performance, because we are considering buying a 12-disc array for
performance reasons. I'm interested in how the performance scales with the
number of discs in the array.

Now, I know that for an OLTP workload (in other words, lots of small
parallel random accesses), the performance will scale almost with the
number of discs. However, I'm more interested in the performance of
individual queries, particularly those where Postgres has to do an index
scan, which will result in a single query performing lots of random
accesses to the disc system. Theoretically, this *can* scale with the
number of discs too - my question is does it?

Does Postgres issue requests to each random access in turn, waiting for
each one to complete before issuing the next request (in which case the
performance will not exceed that of a single disc), or does it use some
clever asynchronous access method to send a queue of random access
requests to the OS that can be distributed among the available discs?

Any knowledgable answers or benchmark proof would be appreciated,

Matthew

--
"To err is human; to really louse things up requires root
privileges." -- Alexander Pope, slightly paraphrased

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 10:42 AM
Gregory Stark
 
Posts: n/a
Default Re: RAID arrays and performance

"Matthew" <matthew@flymine.org> writes:

> Does Postgres issue requests to each random access in turn, waiting for
> each one to complete before issuing the next request (in which case the
> performance will not exceed that of a single disc), or does it use some
> clever asynchronous access method to send a queue of random access
> requests to the OS that can be distributed among the available discs?


Sorry, it does the former, at least currently.

That said, this doesn't really come up nearly as often as you might think.
Normally queries fit mostly in either the large batch query domain or the
small quick oltp query domain. For the former Postgres tries quite hard to do
sequential i/o which the OS will do readahead for and you'll get good
performance. For the latter you're normally running many simultaneous such
queries and the raid array helps quite a bit.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 10:42 AM
Matthew
 
Posts: n/a
Default Re: RAID arrays and performance

On Tue, 4 Dec 2007, Gregory Stark wrote:
> "Matthew" <matthew@flymine.org> writes:
>
> > Does Postgres issue requests to each random access in turn, waiting for
> > each one to complete before issuing the next request (in which case the
> > performance will not exceed that of a single disc), or does it use some
> > clever asynchronous access method to send a queue of random access
> > requests to the OS that can be distributed among the available discs?

>
> Sorry, it does the former, at least currently.
>
> That said, this doesn't really come up nearly as often as you might think.


Shame. It comes up a *lot* in my project. A while ago we converted a task
that processes a queue of objects to processing groups of a thousand
objects, which sped up the process considerably. So we run an awful lot of
queries with IN lists with a thousand values. They hit the indexes, then
fetch the rows by random access. A full table sequential scan would take
much longer. It'd be awfully nice to have those queries go twelve times
faster.

> Normally queries fit mostly in either the large batch query domain or the
> small quick oltp query domain. For the former Postgres tries quite hard to do
> sequential i/o which the OS will do readahead for and you'll get good
> performance. For the latter you're normally running many simultaneous such
> queries and the raid array helps quite a bit.


Having twelve discs will certainly improve the sequential IO throughput!

However, if this was implemented (and I have *no* idea whatsoever how hard
it would be), then large index scans would scale with the number of discs
in the system, which would be quite a win, I would imagine. Large index
scans can't be that rare!

Matthew

--
Software suppliers are trying to make their software packages more
'user-friendly'.... Their best approach, so far, has been to take all
the old brochures, and stamp the words, 'user-friendly' on the cover.
-- Bill Gates

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 10:42 AM
Mark Mielke
 
Posts: n/a
Default Re: RAID arrays and performance

Matthew wrote:
> On Tue, 4 Dec 2007, Gregory Stark wrote:
>
>> "Matthew" <matthew@flymine.org> writes
>>> Does Postgres issue requests to each random access in turn, waiting for
>>> each one to complete before issuing the next request (in which case the
>>> performance will not exceed that of a single disc), or does it use some
>>> clever asynchronous access method to send a queue of random access
>>> requests to the OS that can be distributed among the available discs?
>>>

>> Sorry, it does the former, at least currently.
>> That said, this doesn't really come up nearly as often as you might think.
>>

> Shame. It comes up a *lot* in my project. A while ago we converted a task
> that processes a queue of objects to processing groups of a thousand
> objects, which sped up the process considerably. So we run an awful lot of
> queries with IN lists with a thousand values. They hit the indexes, then
> fetch the rows by random access. A full table sequential scan would take
> much longer. It'd be awfully nice to have those queries go twelve times
> faster.
>

The bitmap scan method does ordered reads of the table, which can
partially take advantage of sequential reads. Not sure whether bitmap
scan is optimal for your situation or whether your situation would allow
this to be taken advantage of.

>> Normally queries fit mostly in either the large batch query domain or the
>> small quick oltp query domain. For the former Postgres tries quite hard to do
>> sequential i/o which the OS will do readahead for and you'll get good
>> performance. For the latter you're normally running many simultaneous such
>> queries and the raid array helps quite a bit.
>>

> Having twelve discs will certainly improve the sequential IO throughput!
>
> However, if this was implemented (and I have *no* idea whatsoever how hard
> it would be), then large index scans would scale with the number of discs
> in the system, which would be quite a win, I would imagine. Large index
> scans can't be that rare!
>

Do you know that there is a problem, or are you speculating about one? I
think your case would be far more compelling if you could show a
problem. :-)

I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0
would allow your insane queries to run concurrent with up to 12 other
queries. Unless your insane query is the only query in use on the
system, I think you may be speculating about a nearly non-existence
problem. Just a suggestion...

I recall talk of more intelligent table scanning algorithms, and the use
of asynchronous I/O to benefit from RAID arrays, but the numbers
prepared to convince people that the change would have effect have been
less than impressive.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 10:42 AM
Matthew
 
Posts: n/a
Default Re: RAID arrays and performance

On Tue, 4 Dec 2007, Mark Mielke wrote:
> The bitmap scan method does ordered reads of the table, which can
> partially take advantage of sequential reads. Not sure whether bitmap
> scan is optimal for your situation or whether your situation would allow
> this to be taken advantage of.


Bitmap scan may help where several randomly-accessed pages are next to
each other. However, I would expect the operating system's readahead and
buffer cache to take care of that one.

> Do you know that there is a problem, or are you speculating about one? I
> think your case would be far more compelling if you could show a
> problem. :-)
>
> I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0
> would allow your insane queries to run concurrent with up to 12 other
> queries.


Yeah, I don't really care about concurrency. It's pretty obvious that
running x concurrent queries on an x-disc RAID system will allow the
utilisation of all the discs at once, therefore allowing the performance
to scale by x. What I'm talking about is a single query running on an
x-disc RAID array.

> Unless your insane query is the only query in use on the system...


That's exactly the case.

> I recall talk of more intelligent table scanning algorithms, and the use
> of asynchronous I/O to benefit from RAID arrays, but the numbers
> prepared to convince people that the change would have effect have been
> less than impressive.


I think a twelve-times speed increase is impressive. Granted, given
greatly-concurrent access, the benefits go away, but I think it'd be worth
it for when there are few queries running on the system.

I don't think you would have to create a more intelligent table scanning
algorithm. What you would need to do is take the results of the index,
convert that to a list of page fetches, then pass that list to the OS as
an asynchronous "please fetch all these into the buffer cache" request,
then do the normal algorithm as is currently done. The requests would then
come out of the cache instead of from the disc. Of course, this is from a
simple Java programmer who doesn't know the OS interfaces for this sort of
thing.

Matthew

--
Here we go - the Fairy Godmother redundancy proof.
-- Computer Science Lecturer

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 10:42 AM
Gregory Stark
 
Posts: n/a
Default Re: RAID arrays and performance


"Mark Mielke" <mark@mark.mielke.cc> writes:

> Matthew wrote:
>
>> I don't think you would have to create a more intelligent table scanning
>> algorithm. What you would need to do is take the results of the index,
>> convert that to a list of page fetches, then pass that list to the OS as
>> an asynchronous "please fetch all these into the buffer cache" request,
>> then do the normal algorithm as is currently done. The requests would then
>> come out of the cache instead of from the disc. Of course, this is from a
>> simple Java programmer who doesn't know the OS interfaces for this sort of
>> thing.

>
> That's about how the talk went. :-)
>
> The problem is that a 12X speed for 12 disks seems unlikely except under very
> specific loads (such as a sequential scan of a single table). Each of the
> indexes may need to be scanned or searched in turn, then each of the tables
> would need to be scanned or searched in turn, depending on the query plan.
> There is no guarantee that the index rows or the table rows are equally spread
> across the 12 disks. CPU processing becomes involved with is currently limited
> to a single processor thread. I suspect no database would achieve a 12X speedup
> for 12 disks unless a simple sequential scan of a single table was required, in
> which case the reads could be fully parallelized with RAID 0 using standard
> sequential reads, and this is available today using built-in OS or disk
> read-ahead.


I'm sure you would get something between 1x and 12x though...

I'm rerunning my synthetic readahead tests now. That doesn't show the effect
of the other cpu and i/o work being done in the meantime but surely if they're
being evicted from cache too soon that just means your machine is starved for
cache and you should add more RAM?

Also, it's true, you need to preread more than 12 blocks to handle a 12-disk
raid. My offhand combinatorics analysis seems to indicate you would expect to
need to n(n-1)/2 blocks on average before you've hit all the blocks. There's
little penalty to prereading unless you use up too much kernel resources or
you do unnecessary i/o which you never use, so I would expect doing n^2 capped
at some reasonable number like 1,000 pages (enough to handle a 32-disk raid)
would be reasonable.

The real trick is avoiding doing prefetches that are never needed. The user
may never actually read all the tuples being requested. I think that means we
shouldn't prefetch until the second tuple is read and then gradually increase
the prefetch distance as you read more and more of the results.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 10:42 AM
Matthew
 
Posts: n/a
Default Re: RAID arrays and performance

On Tue, 4 Dec 2007, Mark Mielke wrote:
> The disk head has less theoretical distance to travel if always moving
> in a single direction instead of randomly seeking back and forth.


True... and false. The head can move pretty quickly, and it also has
rotational latency and settling time to deal with. This means that there
are cases where it is actually faster to move across the disc and read a
block, then move back and read a block than to read them in order.

So, if you hand requests one by one to the disc, it will almost always be
faster to order them. On the other hand, if you hand a huge long list of
requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the
reads itself, and it will do it much better than you.

> The time to seek to a particular sector does not reduce 12X with 12
> disks. It is still approximately the same, only it can handle 12X the
> concurrency. This makes RAID best for concurrent loads. In your
> scenario, you are trying to make a single query take advantage of this
> concurrency by taking advantage of the system cache.


Kind of. The system cache is just a method to make it simpler to explain -
I don't know the operating system interfaces, but it's likely that the
actual call is something more like "transfer these blocks to these memory
locations and tell me when they're all finished." I'm trying to make a
single query concurrent by using the knowledge of a *list* of accesses to
be made, and getting the operating system to do all of them concurrently.

> The problem is that a 12X speed for 12 disks seems unlikely except under
> very specific loads (such as a sequential scan of a single table).


I'll grant you that 12X may not actually be reached, but it'll be
somewhere between 1X and 12X. I'm optimistic.

> Each of the indexes may need to be scanned or searched in turn, then
> each of the tables would need to be scanned or searched in turn,
> depending on the query plan.


Yes, the indexes would also have to be accessed concurrently, and that
will undoubtedly be harder to code than accessing the tables concurrently.

> There is no guarantee that the index rows or the table rows are equally
> spread across the 12 disks.


Indeed. However, you will get an advantage if they are spread out at all.
Statistically, the larger the number of pages that need to be retrieved in
the set, the more equally spread between the discs they will be. It's the
times when there are a large number of pages to retrieve that this will be
most useful.

> CPU processing becomes involved with is currently limited to a single
> processor thread.


On the contrary, this is a problem at the moment for sequential table
scans, but would not be a problem for random accesses. If you have twelve
discs all throwing 80MB/s at the CPU, it's understandable that the CPU
won't keep up. However, when you're making random accesses, with say a
15,000 rpm disc, and retrieving a single 8k page on every access, each
disc will be producing a maximum of 2MB per second, which can be handled
quite easily by modern CPUs. Index scans are limited by the disc, not the
CPU.

Mathew

--
A. Top Posters
> Q. What's the most annoying thing in the world?


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 10:42 AM
Matthew
 
Posts: n/a
Default Re: RAID arrays and performance

On Tue, 4 Dec 2007, Gregory Stark wrote:
> Also, it's true, you need to preread more than 12 blocks to handle a 12-disk
> raid. My offhand combinatorics analysis seems to indicate you would expect to
> need to n(n-1)/2 blocks on average before you've hit all the blocks. There's
> little penalty to prereading unless you use up too much kernel resources or
> you do unnecessary i/o which you never use, so I would expect doing n^2 capped
> at some reasonable number like 1,000 pages (enough to handle a 32-disk raid)
> would be reasonable.


It's better than that, actually. Let's assume a RAID 0 striped set of
twelve discs. If you spread out twelve requests to twelve discs, then the
expected number of requests to each disc is one. The probablility that any
single disc receives more than say three requests is rather small. As you
increase the number of requests, the longest reasonably-expected queue
length for a particular disc gets closer to the number of requests divided
by the number of discs, as the requests get spread more and more evenly
among the discs.

The larger the set of requests, the closer the performance will scale to
the number of discs.

Matthew

--
All of this sounds mildly turgid and messy and confusing... but what the
heck. That's what programming's all about, really
-- Computer Science Lecturer

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 10:42 AM
Gregory Stark
 
Posts: n/a
Default Re: RAID arrays and performance


Fwiw, what made you bring up this topic now? You're the second person in about
two days to bring up precisely this issue and it was an issue I had been
planning to bring up on -hackers as it was.

"Matthew" <matthew@flymine.org> writes:

> Kind of. The system cache is just a method to make it simpler to explain -
> I don't know the operating system interfaces, but it's likely that the
> actual call is something more like "transfer these blocks to these memory
> locations and tell me when they're all finished." I'm trying to make a
> single query concurrent by using the knowledge of a *list* of accesses to
> be made, and getting the operating system to do all of them concurrently.


There are two interfaces of which I'm aware of. posix_fadvise() which just
tells the kernel you'll be needing the blocks soon. Linux at least initiates
I/O on them into cache. libaio which lets you specify the location to read the
blocks and how to notify you when they're ready.

On Linux posix_fadvise works great but libaio doesn't seem to gain any speed
bonus, at least on 2.6.22 with glibc 2.6.1. I was under the impression there
was a kernel implementation somewhere but apparently it's not helping.

On Solaris apparently it doesn't have posix_fadvise but libaio works great. We
could use libaio as a kind of backdoor fadvise where we just initiate i/o on
the block but throw away the results assuming they'll stay in cache for the
real read or we could add an asynchronous interface to the buffer manager. The
latter is attractive but would be a much more invasive patch. I'm inclined to
start with the former.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 10:42 AM
Mark Mielke
 
Posts: n/a
Default Re: RAID arrays and performance

Matthew wrote:
> On Tue, 4 Dec 2007, Gregory Stark wrote:
>
>> Also, it's true, you need to preread more than 12 blocks to handle a 12-disk
>> raid. My offhand combinatorics analysis seems to indicate you would expect to
>> need to n(n-1)/2 blocks on average before you've hit all the blocks. There's
>> little penalty to prereading unless you use up too much kernel resources or
>> you do unnecessary i/o which you never use, so I would expect doing n^2 capped
>> at some reasonable number like 1,000 pages (enough to handle a 32-disk raid)
>> would be reasonable.
>>

> It's better than that, actually. Let's assume a RAID 0 striped set of
> twelve discs. If you spread out twelve requests to twelve discs, then the
> expected number of requests to each disc is one. The probablility that any
> single disc receives more than say three requests is rather small. As you
> increase the number of requests, the longest reasonably-expected queue
> length for a particular disc gets closer to the number of requests divided
> by the number of discs, as the requests get spread more and more evenly
> among the discs.
>
> The larger the set of requests, the closer the performance will scale to
> the number of discs


This assumes that you can know which pages to fetch ahead of time -
which you do not except for sequential read of a single table.

I think it would be possible that your particular case could be up to 6X
faster, but that most other people will see little or no speed up. If it
does read the wrong pages, it is wasting it's time.

I am not trying to discourage you - only trying to ensure that you have
reasonable expectations. 12X is far too optimistic.

Please show one of your query plans and how you as a person would design
which pages to request reads for.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:22 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com