This is a discussion on RAID arrays and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> I have a question about how Postgres makes use of RAID arrays for performance, because we are considering buying ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I have a question about how Postgres makes use of RAID arrays for performance, because we are considering buying a 12-disc array for performance reasons. I'm interested in how the performance scales with the number of discs in the array. Now, I know that for an OLTP workload (in other words, lots of small parallel random accesses), the performance will scale almost with the number of discs. However, I'm more interested in the performance of individual queries, particularly those where Postgres has to do an index scan, which will result in a single query performing lots of random accesses to the disc system. Theoretically, this *can* scale with the number of discs too - my question is does it? Does Postgres issue requests to each random access in turn, waiting for each one to complete before issuing the next request (in which case the performance will not exceed that of a single disc), or does it use some clever asynchronous access method to send a queue of random access requests to the OS that can be distributed among the available discs? Any knowledgable answers or benchmark proof would be appreciated, Matthew -- "To err is human; to really louse things up requires root privileges." -- Alexander Pope, slightly paraphrased ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| "Matthew" <matthew@flymine.org> writes: > Does Postgres issue requests to each random access in turn, waiting for > each one to complete before issuing the next request (in which case the > performance will not exceed that of a single disc), or does it use some > clever asynchronous access method to send a queue of random access > requests to the OS that can be distributed among the available discs? Sorry, it does the former, at least currently. That said, this doesn't really come up nearly as often as you might think. Normally queries fit mostly in either the large batch query domain or the small quick oltp query domain. For the former Postgres tries quite hard to do sequential i/o which the OS will do readahead for and you'll get good performance. For the latter you're normally running many simultaneous such queries and the raid array helps quite a bit. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| On Tue, 4 Dec 2007, Gregory Stark wrote: > "Matthew" <matthew@flymine.org> writes: > > > Does Postgres issue requests to each random access in turn, waiting for > > each one to complete before issuing the next request (in which case the > > performance will not exceed that of a single disc), or does it use some > > clever asynchronous access method to send a queue of random access > > requests to the OS that can be distributed among the available discs? > > Sorry, it does the former, at least currently. > > That said, this doesn't really come up nearly as often as you might think. Shame. It comes up a *lot* in my project. A while ago we converted a task that processes a queue of objects to processing groups of a thousand objects, which sped up the process considerably. So we run an awful lot of queries with IN lists with a thousand values. They hit the indexes, then fetch the rows by random access. A full table sequential scan would take much longer. It'd be awfully nice to have those queries go twelve times faster. > Normally queries fit mostly in either the large batch query domain or the > small quick oltp query domain. For the former Postgres tries quite hard to do > sequential i/o which the OS will do readahead for and you'll get good > performance. For the latter you're normally running many simultaneous such > queries and the raid array helps quite a bit. Having twelve discs will certainly improve the sequential IO throughput! However, if this was implemented (and I have *no* idea whatsoever how hard it would be), then large index scans would scale with the number of discs in the system, which would be quite a win, I would imagine. Large index scans can't be that rare! Matthew -- Software suppliers are trying to make their software packages more 'user-friendly'.... Their best approach, so far, has been to take all the old brochures, and stamp the words, 'user-friendly' on the cover. -- Bill Gates ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Matthew wrote: > On Tue, 4 Dec 2007, Gregory Stark wrote: > >> "Matthew" <matthew@flymine.org> writes >>> Does Postgres issue requests to each random access in turn, waiting for >>> each one to complete before issuing the next request (in which case the >>> performance will not exceed that of a single disc), or does it use some >>> clever asynchronous access method to send a queue of random access >>> requests to the OS that can be distributed among the available discs? >>> >> Sorry, it does the former, at least currently. >> That said, this doesn't really come up nearly as often as you might think. >> > Shame. It comes up a *lot* in my project. A while ago we converted a task > that processes a queue of objects to processing groups of a thousand > objects, which sped up the process considerably. So we run an awful lot of > queries with IN lists with a thousand values. They hit the indexes, then > fetch the rows by random access. A full table sequential scan would take > much longer. It'd be awfully nice to have those queries go twelve times > faster. > The bitmap scan method does ordered reads of the table, which can partially take advantage of sequential reads. Not sure whether bitmap scan is optimal for your situation or whether your situation would allow this to be taken advantage of. >> Normally queries fit mostly in either the large batch query domain or the >> small quick oltp query domain. For the former Postgres tries quite hard to do >> sequential i/o which the OS will do readahead for and you'll get good >> performance. For the latter you're normally running many simultaneous such >> queries and the raid array helps quite a bit. >> > Having twelve discs will certainly improve the sequential IO throughput! > > However, if this was implemented (and I have *no* idea whatsoever how hard > it would be), then large index scans would scale with the number of discs > in the system, which would be quite a win, I would imagine. Large index > scans can't be that rare! > Do you know that there is a problem, or are you speculating about one? I think your case would be far more compelling if you could show a problem. :-) I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0 would allow your insane queries to run concurrent with up to 12 other queries. Unless your insane query is the only query in use on the system, I think you may be speculating about a nearly non-existence problem. Just a suggestion... I recall talk of more intelligent table scanning algorithms, and the use of asynchronous I/O to benefit from RAID arrays, but the numbers prepared to convince people that the change would have effect have been less than impressive. Cheers, mark -- Mark Mielke <mark@mielke.cc> ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Tue, 4 Dec 2007, Mark Mielke wrote: > The bitmap scan method does ordered reads of the table, which can > partially take advantage of sequential reads. Not sure whether bitmap > scan is optimal for your situation or whether your situation would allow > this to be taken advantage of. Bitmap scan may help where several randomly-accessed pages are next to each other. However, I would expect the operating system's readahead and buffer cache to take care of that one. > Do you know that there is a problem, or are you speculating about one? I > think your case would be far more compelling if you could show a > problem. :-) > > I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0 > would allow your insane queries to run concurrent with up to 12 other > queries. Yeah, I don't really care about concurrency. It's pretty obvious that running x concurrent queries on an x-disc RAID system will allow the utilisation of all the discs at once, therefore allowing the performance to scale by x. What I'm talking about is a single query running on an x-disc RAID array. > Unless your insane query is the only query in use on the system... That's exactly the case. > I recall talk of more intelligent table scanning algorithms, and the use > of asynchronous I/O to benefit from RAID arrays, but the numbers > prepared to convince people that the change would have effect have been > less than impressive. I think a twelve-times speed increase is impressive. Granted, given greatly-concurrent access, the benefits go away, but I think it'd be worth it for when there are few queries running on the system. I don't think you would have to create a more intelligent table scanning algorithm. What you would need to do is take the results of the index, convert that to a list of page fetches, then pass that list to the OS as an asynchronous "please fetch all these into the buffer cache" request, then do the normal algorithm as is currently done. The requests would then come out of the cache instead of from the disc. Of course, this is from a simple Java programmer who doesn't know the OS interfaces for this sort of thing. Matthew -- Here we go - the Fairy Godmother redundancy proof. -- Computer Science Lecturer ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| "Mark Mielke" <mark@mark.mielke.cc> writes: > Matthew wrote: > >> I don't think you would have to create a more intelligent table scanning >> algorithm. What you would need to do is take the results of the index, >> convert that to a list of page fetches, then pass that list to the OS as >> an asynchronous "please fetch all these into the buffer cache" request, >> then do the normal algorithm as is currently done. The requests would then >> come out of the cache instead of from the disc. Of course, this is from a >> simple Java programmer who doesn't know the OS interfaces for this sort of >> thing. > > That's about how the talk went. :-) > > The problem is that a 12X speed for 12 disks seems unlikely except under very > specific loads (such as a sequential scan of a single table). Each of the > indexes may need to be scanned or searched in turn, then each of the tables > would need to be scanned or searched in turn, depending on the query plan. > There is no guarantee that the index rows or the table rows are equally spread > across the 12 disks. CPU processing becomes involved with is currently limited > to a single processor thread. I suspect no database would achieve a 12X speedup > for 12 disks unless a simple sequential scan of a single table was required, in > which case the reads could be fully parallelized with RAID 0 using standard > sequential reads, and this is available today using built-in OS or disk > read-ahead. I'm sure you would get something between 1x and 12x though... I'm rerunning my synthetic readahead tests now. That doesn't show the effect of the other cpu and i/o work being done in the meantime but surely if they're being evicted from cache too soon that just means your machine is starved for cache and you should add more RAM? Also, it's true, you need to preread more than 12 blocks to handle a 12-disk raid. My offhand combinatorics analysis seems to indicate you would expect to need to n(n-1)/2 blocks on average before you've hit all the blocks. There's little penalty to prereading unless you use up too much kernel resources or you do unnecessary i/o which you never use, so I would expect doing n^2 capped at some reasonable number like 1,000 pages (enough to handle a 32-disk raid) would be reasonable. The real trick is avoiding doing prefetches that are never needed. The user may never actually read all the tuples being requested. I think that means we shouldn't prefetch until the second tuple is read and then gradually increase the prefetch distance as you read more and more of the results. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Tue, 4 Dec 2007, Mark Mielke wrote: > The disk head has less theoretical distance to travel if always moving > in a single direction instead of randomly seeking back and forth. True... and false. The head can move pretty quickly, and it also has rotational latency and settling time to deal with. This means that there are cases where it is actually faster to move across the disc and read a block, then move back and read a block than to read them in order. So, if you hand requests one by one to the disc, it will almost always be faster to order them. On the other hand, if you hand a huge long list of requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the reads itself, and it will do it much better than you. > The time to seek to a particular sector does not reduce 12X with 12 > disks. It is still approximately the same, only it can handle 12X the > concurrency. This makes RAID best for concurrent loads. In your > scenario, you are trying to make a single query take advantage of this > concurrency by taking advantage of the system cache. Kind of. The system cache is just a method to make it simpler to explain - I don't know the operating system interfaces, but it's likely that the actual call is something more like "transfer these blocks to these memory locations and tell me when they're all finished." I'm trying to make a single query concurrent by using the knowledge of a *list* of accesses to be made, and getting the operating system to do all of them concurrently. > The problem is that a 12X speed for 12 disks seems unlikely except under > very specific loads (such as a sequential scan of a single table). I'll grant you that 12X may not actually be reached, but it'll be somewhere between 1X and 12X. I'm optimistic. > Each of the indexes may need to be scanned or searched in turn, then > each of the tables would need to be scanned or searched in turn, > depending on the query plan. Yes, the indexes would also have to be accessed concurrently, and that will undoubtedly be harder to code than accessing the tables concurrently. > There is no guarantee that the index rows or the table rows are equally > spread across the 12 disks. Indeed. However, you will get an advantage if they are spread out at all. Statistically, the larger the number of pages that need to be retrieved in the set, the more equally spread between the discs they will be. It's the times when there are a large number of pages to retrieve that this will be most useful. > CPU processing becomes involved with is currently limited to a single > processor thread. On the contrary, this is a problem at the moment for sequential table scans, but would not be a problem for random accesses. If you have twelve discs all throwing 80MB/s at the CPU, it's understandable that the CPU won't keep up. However, when you're making random accesses, with say a 15,000 rpm disc, and retrieving a single 8k page on every access, each disc will be producing a maximum of 2MB per second, which can be handled quite easily by modern CPUs. Index scans are limited by the disc, not the CPU. Mathew -- A. Top Posters > Q. What's the most annoying thing in the world? ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On Tue, 4 Dec 2007, Gregory Stark wrote: > Also, it's true, you need to preread more than 12 blocks to handle a 12-disk > raid. My offhand combinatorics analysis seems to indicate you would expect to > need to n(n-1)/2 blocks on average before you've hit all the blocks. There's > little penalty to prereading unless you use up too much kernel resources or > you do unnecessary i/o which you never use, so I would expect doing n^2 capped > at some reasonable number like 1,000 pages (enough to handle a 32-disk raid) > would be reasonable. It's better than that, actually. Let's assume a RAID 0 striped set of twelve discs. If you spread out twelve requests to twelve discs, then the expected number of requests to each disc is one. The probablility that any single disc receives more than say three requests is rather small. As you increase the number of requests, the longest reasonably-expected queue length for a particular disc gets closer to the number of requests divided by the number of discs, as the requests get spread more and more evenly among the discs. The larger the set of requests, the closer the performance will scale to the number of discs. Matthew -- All of this sounds mildly turgid and messy and confusing... but what the heck. That's what programming's all about, really -- Computer Science Lecturer ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Fwiw, what made you bring up this topic now? You're the second person in about two days to bring up precisely this issue and it was an issue I had been planning to bring up on -hackers as it was. "Matthew" <matthew@flymine.org> writes: > Kind of. The system cache is just a method to make it simpler to explain - > I don't know the operating system interfaces, but it's likely that the > actual call is something more like "transfer these blocks to these memory > locations and tell me when they're all finished." I'm trying to make a > single query concurrent by using the knowledge of a *list* of accesses to > be made, and getting the operating system to do all of them concurrently. There are two interfaces of which I'm aware of. posix_fadvise() which just tells the kernel you'll be needing the blocks soon. Linux at least initiates I/O on them into cache. libaio which lets you specify the location to read the blocks and how to notify you when they're ready. On Linux posix_fadvise works great but libaio doesn't seem to gain any speed bonus, at least on 2.6.22 with glibc 2.6.1. I was under the impression there was a kernel implementation somewhere but apparently it's not helping. On Solaris apparently it doesn't have posix_fadvise but libaio works great. We could use libaio as a kind of backdoor fadvise where we just initiate i/o on the block but throw away the results assuming they'll stay in cache for the real read or we could add an asynchronous interface to the buffer manager. The latter is attractive but would be a much more invasive patch. I'm inclined to start with the former. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support! ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| ||||
| Matthew wrote: > On Tue, 4 Dec 2007, Gregory Stark wrote: > >> Also, it's true, you need to preread more than 12 blocks to handle a 12-disk >> raid. My offhand combinatorics analysis seems to indicate you would expect to >> need to n(n-1)/2 blocks on average before you've hit all the blocks. There's >> little penalty to prereading unless you use up too much kernel resources or >> you do unnecessary i/o which you never use, so I would expect doing n^2 capped >> at some reasonable number like 1,000 pages (enough to handle a 32-disk raid) >> would be reasonable. >> > It's better than that, actually. Let's assume a RAID 0 striped set of > twelve discs. If you spread out twelve requests to twelve discs, then the > expected number of requests to each disc is one. The probablility that any > single disc receives more than say three requests is rather small. As you > increase the number of requests, the longest reasonably-expected queue > length for a particular disc gets closer to the number of requests divided > by the number of discs, as the requests get spread more and more evenly > among the discs. > > The larger the set of requests, the closer the performance will scale to > the number of discs This assumes that you can know which pages to fetch ahead of time - which you do not except for sequential read of a single table. I think it would be possible that your particular case could be up to 6X faster, but that most other people will see little or no speed up. If it does read the wrong pages, it is wasting it's time. I am not trying to discourage you - only trying to ensure that you have reasonable expectations. 12X is far too optimistic. Please show one of your query plans and how you as a person would design which pages to request reads for. Cheers, mark -- Mark Mielke <mark@mielke.cc> |