Unix Technical Forum

RAID arrays and performance

This is a discussion on RAID arrays and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> Matthew wrote: > On Tue, 4 Dec 2007, Mark Mielke wrote: > >> The bitmap scan method does ordered ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-19-2008, 11:42 AM
Mark Mielke
 
Posts: n/a
Default Re: RAID arrays and performance

Matthew wrote:
> On Tue, 4 Dec 2007, Mark Mielke wrote:
>
>> The bitmap scan method does ordered reads of the table, which can
>> partially take advantage of sequential reads. Not sure whether bitmap
>> scan is optimal for your situation or whether your situation would allow
>> this to be taken advantage of.
>>

> Bitmap scan may help where several randomly-accessed pages are next to
> each other. However, I would expect the operating system's readahead and
> buffer cache to take care of that one.
>

The disk head has less theoretical distance to travel if always moving
in a single direction instead of randomly seeking back and forth.
>> Do you know that there is a problem, or are you speculating about one? I
>> think your case would be far more compelling if you could show a
>> problem. :-)
>>
>> I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0
>> would allow your insane queries to run concurrent with up to 12 other
>> queries.
>>

> Yeah, I don't really care about concurrency. It's pretty obvious that
> running x concurrent queries on an x-disc RAID system will allow the
> utilisation of all the discs at once, therefore allowing the performance
> to scale by x. What I'm talking about is a single query running on an
> x-disc RAID array.
>

The time to seek to a particular sector does not reduce 12X with 12
disks. It is still approximately the same, only it can handle 12X the
concurrency. This makes RAID best for concurrent loads. In your
scenario, you are trying to make a single query take advantage of this
concurrency by taking advantage of the system cache. I don't think full
12X concurrency of a single query is possible for most loads, probably
including yours. See later for details.
>> I recall talk of more intelligent table scanning algorithms, and the use
>> of asynchronous I/O to benefit from RAID arrays, but the numbers
>> prepared to convince people that the change would have effect have been
>> less than impressive.
>>

> I think a twelve-times speed increase is impressive. Granted, given
> greatly-concurrent access, the benefits go away, but I think it'd be worth
> it for when there are few queries running on the system.
>
> I don't think you would have to create a more intelligent table scanning
> algorithm. What you would need to do is take the results of the index,
> convert that to a list of page fetches, then pass that list to the OS as
> an asynchronous "please fetch all these into the buffer cache" request,
> then do the normal algorithm as is currently done. The requests would then
> come out of the cache instead of from the disc. Of course, this is from a
> simple Java programmer who doesn't know the OS interfaces for this sort of
> thing.
>

That's about how the talk went. :-)

The problem is that a 12X speed for 12 disks seems unlikely except under
very specific loads (such as a sequential scan of a single table). Each
of the indexes may need to be scanned or searched in turn, then each of
the tables would need to be scanned or searched in turn, depending on
the query plan. There is no guarantee that the index rows or the table
rows are equally spread across the 12 disks. CPU processing becomes
involved with is currently limited to a single processor thread. I
suspect no database would achieve a 12X speedup for 12 disks unless a
simple sequential scan of a single table was required, in which case the
reads could be fully parallelized with RAID 0 using standard sequential
reads, and this is available today using built-in OS or disk read-ahead.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-19-2008, 11:42 AM
James Mansion
 
Posts: n/a
Default Re: RAID arrays and performance

matthew@flymine.org wrote:
> So, if you hand requests one by one to the disc, it will almost always be
> faster to order them. On the other hand, if you hand a huge long list of
> requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the
> reads itself, and it will do it much better than you.
>
>

Sure - but this doesn't suggest threading so much as pushing all reads
into AIO as soon as they can
be identified - and hoping that your os has a decent AIO subsystem,
which is sadly a tall order
for many *nix systems.

I do think some thought should be given to threading but I would expect
the effect to be more
noticeable for updates where you update tables that have multiple
indices. In the case of your
scan then you need threading on CPU (rather than concurrent IO through
AIO) if the disks
can feed you data faster than you can scan it. Which might be the case
for some scans
using user functions, but I wouldn't have thought it would be the case
for a sinple index scan.

At some point, hopefully, the engines will be:
a) thread safe (if not thread hot) so it can exist with threaded user
functions and embedded
languages
b) able to incorporate C++ add-in functionality

There may not be a pressing reason to support either of these, but
having a capability to
experiment would surely be helpful and allow incremental enhancement -
so baby steps
could be made to (for example) move stats and logging to a background
thread, move
push of results to clients out of the query evaluator thread, and so
on. Parallel threading
queries is a whle different ball game which needs thought in the optimiser.

James


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-19-2008, 11:42 AM
James Mansion
 
Posts: n/a
Default Re: RAID arrays and performance

Mark Mielke wrote:
> This assumes that you can know which pages to fetch ahead of time -
> which you do not except for sequential read of a single table.
>

Why doesn't it help to issue IO ahead-of-time requests when you are
scanning an index? You can read-ahead
in index pages, and submit requests for data pages as soon as it is
clear you'll want them. Doing so can allow
the disks and OS to relax the order in which you receive them, which may
allow you to process them while IO
continues, and it may also optimise away some seeking and settle time.
Maybe.


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-19-2008, 11:42 AM
Matthew
 
Posts: n/a
Default Re: RAID arrays and performance

On Tue, 4 Dec 2007, Mark Mielke wrote:
> > The larger the set of requests, the closer the performance will scale to
> > the number of discs

>
> This assumes that you can know which pages to fetch ahead of time -
> which you do not except for sequential read of a single table.


There are circumstances where it may be hard to locate all the pages ahead
of time - that's probably when you're doing a nested loop join. However,
if you're looking up in an index and get a thousand row hits in the index,
then there you go. Page locations to load.

> Please show one of your query plans and how you as a person would design
> which pages to request reads for.


How about the query that "cluster <skrald@amossen.dk>" was trying to get
to run faster a few days ago? Tom Lane wrote about it:

| Wouldn't help, because the accesses to "questions" are not the problem.
| The query's spending nearly all its time in the scan of "posts", and
| I'm wondering why --- doesn't seem like it should take 6400msec to fetch
| 646 rows, unless perhaps the data is just horribly misordered relative
| to the index.

Which is exactly what's going on. The disc is having to seek 646 times
fetching a single row each time, and that takes 6400ms. He obviously has a
standard 5,400 or 7,200 rpm drive with a seek time around 10ms.

Or on a similar vein, fill a table with completely random values, say ten
million rows with a column containing integer values ranging from zero to
ten thousand. Create an index on that column, analyse it. Then pick a
number between zero and ten thousand, and

"SELECT * FROM table WHERE that_column = the_number_you_picked"

Matthew

--
Experience is what allows you to recognise a mistake the second time you
make it.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-19-2008, 11:42 AM
Mark Mielke
 
Posts: n/a
Default Re: RAID arrays and performance

James Mansion wrote:
> Mark Mielke wrote:
>> This assumes that you can know which pages to fetch ahead of time -
>> which you do not except for sequential read of a single table.

> Why doesn't it help to issue IO ahead-of-time requests when you are
> scanning an index? You can read-ahead
> in index pages, and submit requests for data pages as soon as it is
> clear you'll want them. Doing so can allow
> the disks and OS to relax the order in which you receive them, which
> may allow you to process them while IO
> continues, and it may also optimise away some seeking and settle
> time. Maybe.

Sorry to be unclear. To achieve a massive speedup (12X for 12 disks with
RAID 0) requires that you know what reads to perform in advance. The
moment you do not, you only have a starting point, your operations begin
to serialize again. For example, you must scan the first index, to be
able to know what table rows to read. At a minimum, this breaks your
query into: 1) Preload all the index pages you will need, 2) Scan the
index pages you needed, 3) Preload all the table page you will need, 4)
Scan the table pages you needed. But do you really need the whole index?
What if you only need parts of the index, and this plan now reads the
whole index using async I/O "just in case" it is useful? Index is a
B-Tree for a reason. In Matthew's case where he has an IN clause with
thousands of possibles (I think?), perhaps a complete index scan is
always the best case - but that's only one use case, and in my opinion,
an obscure one. As soon as additional table joins become involved, the
chance that whole index scans are required would probably normally
reduce, which turns the index scan into a regular B-Tree scan, which is
difficult to perform async I/O for, as you don't necessarily know which
pages to read next.

It seems like a valuable goal - but throwing imaginary numbers around
does not appeal to me. I am more interested in Gregory's simulations. I
would like to understand his simulation better, and see his results.
Speculation about amazing potential is barely worth the words used to
express it. The real work is in design and implementation. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-19-2008, 11:42 AM
Mark Mielke
 
Posts: n/a
Default Re: RAID arrays and performance

Matthew wrote:
> On Tue, 4 Dec 2007, Mark Mielke wrote:
>
>>> The larger the set of requests, the closer the performance will scale to
>>> the number of discs
>>>

>> This assumes that you can know which pages to fetch ahead of time -
>> which you do not except for sequential read of a single table.
>>

> There are circumstances where it may be hard to locate all the pages ahead
> of time - that's probably when you're doing a nested loop join. However,
> if you're looking up in an index and get a thousand row hits in the index,
> then there you go. Page locations to load.
>

Sure.
>> Please show one of your query plans and how you as a person would design
>> which pages to request reads for.
>>

> How about the query that "cluster <skrald@amossen.dk>" was trying to get
> to run faster a few days ago? Tom Lane wrote about it:
>
> | Wouldn't help, because the accesses to "questions" are not the problem.
> | The query's spending nearly all its time in the scan of "posts", and
> | I'm wondering why --- doesn't seem like it should take 6400msec to fetch
> | 646 rows, unless perhaps the data is just horribly misordered relative
> | to the index.
>
> Which is exactly what's going on. The disc is having to seek 646 times
> fetching a single row each time, and that takes 6400ms. He obviously has a
> standard 5,400 or 7,200 rpm drive with a seek time around 10ms.
>

Your proposal would not necessarily improve his case unless he also
purchased additional disks, at which point his execution time may be
different. More speculation. :-)

It seems reasonable - but still a guess.

> Or on a similar vein, fill a table with completely random values, say ten
> million rows with a column containing integer values ranging from zero to
> ten thousand. Create an index on that column, analyse it. Then pick a
> number between zero and ten thousand, and
>
> "SELECT * FROM table WHERE that_column = the_number_you_picked

This isn't a real use case. Optimizing for the worst case scenario is
not always valuable.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #17 (permalink)  
Old 04-19-2008, 11:42 AM
Matthew
 
Posts: n/a
Default Re: RAID arrays and performance

On Tue, 4 Dec 2007, Gregory Stark wrote:
> Fwiw, what made you bring up this topic now? You're the second person in about
> two days to bring up precisely this issue and it was an issue I had been
> planning to bring up on -hackers as it was.


I only just joined the performance mailing list to talk about R-trees. I
would probably have brought it up earlier if I had been here earlier.
However, we're thinking of buying this large machine, and that reminded
me.

I have been biting at the bit for my bosses to allow me time to write an
indexing system for transient data - a lookup table backed by disc,
looking up from an integer to get an object, native in Java. Our system
often needs to fetch a list of a thousand different objects by a key like
that, and Postgres just doesn't do that exact thing fast. I was going to
implement it with full asynchronous IO, to do that particular job very
fast, so I have done a reasonable amount of research into the topic. In
Java, that is. It would add a little bit more performance for our system.
That wouldn't cover us - we still need to do complex queries with the same
problem, and that'll have to stay in Postgres.

Matthew

--
The early bird gets the worm. If you want something else for breakfast, get
up later.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #18 (permalink)  
Old 04-19-2008, 11:42 AM
Mark Mielke
 
Posts: n/a
Default Re: RAID arrays and performance

Matthew wrote:
> On Tue, 4 Dec 2007, Gregory Stark wrote:
>
>> Fwiw, what made you bring up this topic now? You're the second person in about
>> two days to bring up precisely this issue and it was an issue I had been
>> planning to bring up on -hackers as it was.
>>

> I only just joined the performance mailing list to talk about R-trees. I
> would probably have brought it up earlier if I had been here earlier.
> However, we're thinking of buying this large machine, and that reminded
> me.
>
> I have been biting at the bit for my bosses to allow me time to write an
> indexing system for transient data - a lookup table backed by disc,
> looking up from an integer to get an object, native in Java. Our system
> often needs to fetch a list of a thousand different objects by a key like
> that, and Postgres just doesn't do that exact thing fast. I was going to
> implement it with full asynchronous IO, to do that particular job very
> fast, so I have done a reasonable amount of research into the topic. In
> Java, that is. It would add a little bit more performance for our system.
> That wouldn't cover us - we still need to do complex queries with the same
> problem, and that'll have to stay in Postgres

So much excitement and zeal - refreshing to see. And yet, no numbers! :-)

You describe a new asynchronous I/O system to map integers to Java
objects above. Why would you write this? Have you tried BerkeleyDB or
BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java
backend gives you a Java persistence API that will allow you to map Java
objects (including integers) to other Java objects. They use generated
Java run time instructions instead of reflection to store and lock your
Java objects. If it came to a bet, I would bet that their research and
tuning over several years, and many people, would beat your initial
implementation, asynchronous I/O or not.

Asynchronous I/O is no more a magic bullet than threading. It requires a
lot of work to get it right, and if one gets it wrong, it can be slower
than the regular I/O or single threaded scenarios. Both look sexy on
paper, neither may be the solution to your problem. Or they may be. We
wouldn't know without numbers.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #19 (permalink)  
Old 04-19-2008, 11:42 AM
Matthew
 
Posts: n/a
Default Re: RAID arrays and performance

On Tue, 4 Dec 2007, Mark Mielke wrote:
> So much excitement and zeal - refreshing to see. And yet, no numbers! :-)


What sort of numbers did you want to see?

> You describe a new asynchronous I/O system to map integers to Java
> objects above. Why would you write this? Have you tried BerkeleyDB or
> BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java
> backend gives you a Java persistence API that will allow you to map Java
> objects (including integers) to other Java objects.


Looked at all those. Didn't like their performance characteristics, or
their interfaces. It's simply the fact that they're not designed for the
workload that we want to put on them.

> If it came to a bet, I would bet that their research and
> tuning over several years, and many people, would beat your initial
> implementation, asynchronous I/O or not.


Quite possibly. However, there's the possibility that it wouldn't. And I
can improve it - initial implementations often suck.

> Asynchronous I/O is no more a magic bullet than threading. It requires a
> lot of work to get it right, and if one gets it wrong, it can be slower
> than the regular I/O or single threaded scenarios. Both look sexy on
> paper, neither may be the solution to your problem. Or they may be. We
> wouldn't know without numbers.


Okay, numbers. About eight years ago I wrote the shell of a filesystem
implementation, concentrating on performance and integrity. It absolutely
whooped ext2 for both read and write speed - especially metadata write
speed. Anything up to 60 times as fast. I wrote a load of metadata to
ext2, which took 599.52 seconds, and on my system took 10.54 seconds.
Listing it back (presumably from cache) took 1.92 seconds on ext2 and 0.22
seconds on my system. No silly caching tricks that sacrifice integrity.

It's a pity I got nowhere near finishing that system - just enough to
prove my point and get a degree, but looking back on it there are massive
ways it's rubbish and should be improved. It was an initial
implementation. I didn't have reiserfs, jfs, or xfs available at that
time, but it would have been really interesting to compare. This is the
system I would have based my indexing thing on.

Matthew

--
Anyone who goes to a psychiatrist ought to have his head examined.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #20 (permalink)  
Old 04-19-2008, 11:42 AM
Gregory Stark
 
Posts: n/a
Default Re: RAID arrays and performance

"Matthew" <matthew@flymine.org> writes:

> On Tue, 4 Dec 2007, Mark Mielke wrote:
>> So much excitement and zeal - refreshing to see. And yet, no numbers! :-)

>
> What sort of numbers did you want to see?


FWIW I posted some numbers from a synthetic case to pgsql-hackers

http://archives.postgresql.org/pgsql...2/msg00088.php

Very promising


--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:38 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com