This is a discussion on RAID arrays and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> Matthew wrote: > On Tue, 4 Dec 2007, Mark Mielke wrote: > >> The bitmap scan method does ordered ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Matthew wrote: > On Tue, 4 Dec 2007, Mark Mielke wrote: > >> The bitmap scan method does ordered reads of the table, which can >> partially take advantage of sequential reads. Not sure whether bitmap >> scan is optimal for your situation or whether your situation would allow >> this to be taken advantage of. >> > Bitmap scan may help where several randomly-accessed pages are next to > each other. However, I would expect the operating system's readahead and > buffer cache to take care of that one. > The disk head has less theoretical distance to travel if always moving in a single direction instead of randomly seeking back and forth. >> Do you know that there is a problem, or are you speculating about one? I >> think your case would be far more compelling if you could show a >> problem. :-) >> >> I would think that at a minimum, having 12 disks with RAID 0 or RAID 1+0 >> would allow your insane queries to run concurrent with up to 12 other >> queries. >> > Yeah, I don't really care about concurrency. It's pretty obvious that > running x concurrent queries on an x-disc RAID system will allow the > utilisation of all the discs at once, therefore allowing the performance > to scale by x. What I'm talking about is a single query running on an > x-disc RAID array. > The time to seek to a particular sector does not reduce 12X with 12 disks. It is still approximately the same, only it can handle 12X the concurrency. This makes RAID best for concurrent loads. In your scenario, you are trying to make a single query take advantage of this concurrency by taking advantage of the system cache. I don't think full 12X concurrency of a single query is possible for most loads, probably including yours. See later for details. >> I recall talk of more intelligent table scanning algorithms, and the use >> of asynchronous I/O to benefit from RAID arrays, but the numbers >> prepared to convince people that the change would have effect have been >> less than impressive. >> > I think a twelve-times speed increase is impressive. Granted, given > greatly-concurrent access, the benefits go away, but I think it'd be worth > it for when there are few queries running on the system. > > I don't think you would have to create a more intelligent table scanning > algorithm. What you would need to do is take the results of the index, > convert that to a list of page fetches, then pass that list to the OS as > an asynchronous "please fetch all these into the buffer cache" request, > then do the normal algorithm as is currently done. The requests would then > come out of the cache instead of from the disc. Of course, this is from a > simple Java programmer who doesn't know the OS interfaces for this sort of > thing. > That's about how the talk went. :-) The problem is that a 12X speed for 12 disks seems unlikely except under very specific loads (such as a sequential scan of a single table). Each of the indexes may need to be scanned or searched in turn, then each of the tables would need to be scanned or searched in turn, depending on the query plan. There is no guarantee that the index rows or the table rows are equally spread across the 12 disks. CPU processing becomes involved with is currently limited to a single processor thread. I suspect no database would achieve a 12X speedup for 12 disks unless a simple sequential scan of a single table was required, in which case the reads could be fully parallelized with RAID 0 using standard sequential reads, and this is available today using built-in OS or disk read-ahead. Cheers, mark -- Mark Mielke <mark@mielke.cc> ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| matthew@flymine.org wrote: > So, if you hand requests one by one to the disc, it will almost always be > faster to order them. On the other hand, if you hand a huge long list of > requests to a decent SCSI or SATA-NCQ disc in one go, it will reorder the > reads itself, and it will do it much better than you. > > Sure - but this doesn't suggest threading so much as pushing all reads into AIO as soon as they can be identified - and hoping that your os has a decent AIO subsystem, which is sadly a tall order for many *nix systems. I do think some thought should be given to threading but I would expect the effect to be more noticeable for updates where you update tables that have multiple indices. In the case of your scan then you need threading on CPU (rather than concurrent IO through AIO) if the disks can feed you data faster than you can scan it. Which might be the case for some scans using user functions, but I wouldn't have thought it would be the case for a sinple index scan. At some point, hopefully, the engines will be: a) thread safe (if not thread hot) so it can exist with threaded user functions and embedded languages b) able to incorporate C++ add-in functionality There may not be a pressing reason to support either of these, but having a capability to experiment would surely be helpful and allow incremental enhancement - so baby steps could be made to (for example) move stats and logging to a background thread, move push of results to clients out of the query evaluator thread, and so on. Parallel threading queries is a whle different ball game which needs thought in the optimiser. James ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| Mark Mielke wrote: > This assumes that you can know which pages to fetch ahead of time - > which you do not except for sequential read of a single table. > Why doesn't it help to issue IO ahead-of-time requests when you are scanning an index? You can read-ahead in index pages, and submit requests for data pages as soon as it is clear you'll want them. Doing so can allow the disks and OS to relax the order in which you receive them, which may allow you to process them while IO continues, and it may also optimise away some seeking and settle time. Maybe. ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Tue, 4 Dec 2007, Mark Mielke wrote: > > The larger the set of requests, the closer the performance will scale to > > the number of discs > > This assumes that you can know which pages to fetch ahead of time - > which you do not except for sequential read of a single table. There are circumstances where it may be hard to locate all the pages ahead of time - that's probably when you're doing a nested loop join. However, if you're looking up in an index and get a thousand row hits in the index, then there you go. Page locations to load. > Please show one of your query plans and how you as a person would design > which pages to request reads for. How about the query that "cluster <skrald@amossen.dk>" was trying to get to run faster a few days ago? Tom Lane wrote about it: | Wouldn't help, because the accesses to "questions" are not the problem. | The query's spending nearly all its time in the scan of "posts", and | I'm wondering why --- doesn't seem like it should take 6400msec to fetch | 646 rows, unless perhaps the data is just horribly misordered relative | to the index. Which is exactly what's going on. The disc is having to seek 646 times fetching a single row each time, and that takes 6400ms. He obviously has a standard 5,400 or 7,200 rpm drive with a seek time around 10ms. Or on a similar vein, fill a table with completely random values, say ten million rows with a column containing integer values ranging from zero to ten thousand. Create an index on that column, analyse it. Then pick a number between zero and ten thousand, and "SELECT * FROM table WHERE that_column = the_number_you_picked" Matthew -- Experience is what allows you to recognise a mistake the second time you make it. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| James Mansion wrote: > Mark Mielke wrote: >> This assumes that you can know which pages to fetch ahead of time - >> which you do not except for sequential read of a single table. > Why doesn't it help to issue IO ahead-of-time requests when you are > scanning an index? You can read-ahead > in index pages, and submit requests for data pages as soon as it is > clear you'll want them. Doing so can allow > the disks and OS to relax the order in which you receive them, which > may allow you to process them while IO > continues, and it may also optimise away some seeking and settle > time. Maybe. Sorry to be unclear. To achieve a massive speedup (12X for 12 disks with RAID 0) requires that you know what reads to perform in advance. The moment you do not, you only have a starting point, your operations begin to serialize again. For example, you must scan the first index, to be able to know what table rows to read. At a minimum, this breaks your query into: 1) Preload all the index pages you will need, 2) Scan the index pages you needed, 3) Preload all the table page you will need, 4) Scan the table pages you needed. But do you really need the whole index? What if you only need parts of the index, and this plan now reads the whole index using async I/O "just in case" it is useful? Index is a B-Tree for a reason. In Matthew's case where he has an IN clause with thousands of possibles (I think?), perhaps a complete index scan is always the best case - but that's only one use case, and in my opinion, an obscure one. As soon as additional table joins become involved, the chance that whole index scans are required would probably normally reduce, which turns the index scan into a regular B-Tree scan, which is difficult to perform async I/O for, as you don't necessarily know which pages to read next. It seems like a valuable goal - but throwing imaginary numbers around does not appeal to me. I am more interested in Gregory's simulations. I would like to understand his simulation better, and see his results. Speculation about amazing potential is barely worth the words used to express it. The real work is in design and implementation. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc> ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| Matthew wrote: > On Tue, 4 Dec 2007, Mark Mielke wrote: > >>> The larger the set of requests, the closer the performance will scale to >>> the number of discs >>> >> This assumes that you can know which pages to fetch ahead of time - >> which you do not except for sequential read of a single table. >> > There are circumstances where it may be hard to locate all the pages ahead > of time - that's probably when you're doing a nested loop join. However, > if you're looking up in an index and get a thousand row hits in the index, > then there you go. Page locations to load. > Sure. >> Please show one of your query plans and how you as a person would design >> which pages to request reads for. >> > How about the query that "cluster <skrald@amossen.dk>" was trying to get > to run faster a few days ago? Tom Lane wrote about it: > > | Wouldn't help, because the accesses to "questions" are not the problem. > | The query's spending nearly all its time in the scan of "posts", and > | I'm wondering why --- doesn't seem like it should take 6400msec to fetch > | 646 rows, unless perhaps the data is just horribly misordered relative > | to the index. > > Which is exactly what's going on. The disc is having to seek 646 times > fetching a single row each time, and that takes 6400ms. He obviously has a > standard 5,400 or 7,200 rpm drive with a seek time around 10ms. > Your proposal would not necessarily improve his case unless he also purchased additional disks, at which point his execution time may be different. More speculation. :-) It seems reasonable - but still a guess. > Or on a similar vein, fill a table with completely random values, say ten > million rows with a column containing integer values ranging from zero to > ten thousand. Create an index on that column, analyse it. Then pick a > number between zero and ten thousand, and > > "SELECT * FROM table WHERE that_column = the_number_you_picked This isn't a real use case. Optimizing for the worst case scenario is not always valuable. Cheers, mark -- Mark Mielke <mark@mielke.cc> |
| |||
| On Tue, 4 Dec 2007, Gregory Stark wrote: > Fwiw, what made you bring up this topic now? You're the second person in about > two days to bring up precisely this issue and it was an issue I had been > planning to bring up on -hackers as it was. I only just joined the performance mailing list to talk about R-trees. I would probably have brought it up earlier if I had been here earlier. However, we're thinking of buying this large machine, and that reminded me. I have been biting at the bit for my bosses to allow me time to write an indexing system for transient data - a lookup table backed by disc, looking up from an integer to get an object, native in Java. Our system often needs to fetch a list of a thousand different objects by a key like that, and Postgres just doesn't do that exact thing fast. I was going to implement it with full asynchronous IO, to do that particular job very fast, so I have done a reasonable amount of research into the topic. In Java, that is. It would add a little bit more performance for our system. That wouldn't cover us - we still need to do complex queries with the same problem, and that'll have to stay in Postgres. Matthew -- The early bird gets the worm. If you want something else for breakfast, get up later. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Matthew wrote: > On Tue, 4 Dec 2007, Gregory Stark wrote: > >> Fwiw, what made you bring up this topic now? You're the second person in about >> two days to bring up precisely this issue and it was an issue I had been >> planning to bring up on -hackers as it was. >> > I only just joined the performance mailing list to talk about R-trees. I > would probably have brought it up earlier if I had been here earlier. > However, we're thinking of buying this large machine, and that reminded > me. > > I have been biting at the bit for my bosses to allow me time to write an > indexing system for transient data - a lookup table backed by disc, > looking up from an integer to get an object, native in Java. Our system > often needs to fetch a list of a thousand different objects by a key like > that, and Postgres just doesn't do that exact thing fast. I was going to > implement it with full asynchronous IO, to do that particular job very > fast, so I have done a reasonable amount of research into the topic. In > Java, that is. It would add a little bit more performance for our system. > That wouldn't cover us - we still need to do complex queries with the same > problem, and that'll have to stay in Postgres So much excitement and zeal - refreshing to see. And yet, no numbers! :-) You describe a new asynchronous I/O system to map integers to Java objects above. Why would you write this? Have you tried BerkeleyDB or BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java backend gives you a Java persistence API that will allow you to map Java objects (including integers) to other Java objects. They use generated Java run time instructions instead of reflection to store and lock your Java objects. If it came to a bet, I would bet that their research and tuning over several years, and many people, would beat your initial implementation, asynchronous I/O or not. Asynchronous I/O is no more a magic bullet than threading. It requires a lot of work to get it right, and if one gets it wrong, it can be slower than the regular I/O or single threaded scenarios. Both look sexy on paper, neither may be the solution to your problem. Or they may be. We wouldn't know without numbers. Cheers, mark -- Mark Mielke <mark@mielke.cc> |
| |||
| On Tue, 4 Dec 2007, Mark Mielke wrote: > So much excitement and zeal - refreshing to see. And yet, no numbers! :-) What sort of numbers did you want to see? > You describe a new asynchronous I/O system to map integers to Java > objects above. Why would you write this? Have you tried BerkeleyDB or > BerkeleyDB JE? BerkeleyDB with BDB as a backend or their full Java > backend gives you a Java persistence API that will allow you to map Java > objects (including integers) to other Java objects. Looked at all those. Didn't like their performance characteristics, or their interfaces. It's simply the fact that they're not designed for the workload that we want to put on them. > If it came to a bet, I would bet that their research and > tuning over several years, and many people, would beat your initial > implementation, asynchronous I/O or not. Quite possibly. However, there's the possibility that it wouldn't. And I can improve it - initial implementations often suck. > Asynchronous I/O is no more a magic bullet than threading. It requires a > lot of work to get it right, and if one gets it wrong, it can be slower > than the regular I/O or single threaded scenarios. Both look sexy on > paper, neither may be the solution to your problem. Or they may be. We > wouldn't know without numbers. Okay, numbers. About eight years ago I wrote the shell of a filesystem implementation, concentrating on performance and integrity. It absolutely whooped ext2 for both read and write speed - especially metadata write speed. Anything up to 60 times as fast. I wrote a load of metadata to ext2, which took 599.52 seconds, and on my system took 10.54 seconds. Listing it back (presumably from cache) took 1.92 seconds on ext2 and 0.22 seconds on my system. No silly caching tricks that sacrifice integrity. It's a pity I got nowhere near finishing that system - just enough to prove my point and get a degree, but looking back on it there are massive ways it's rubbish and should be improved. It was an initial implementation. I didn't have reiserfs, jfs, or xfs available at that time, but it would have been really interesting to compare. This is the system I would have based my indexing thing on. Matthew -- Anyone who goes to a psychiatrist ought to have his head examined. ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| ||||
| "Matthew" <matthew@flymine.org> writes: > On Tue, 4 Dec 2007, Mark Mielke wrote: >> So much excitement and zeal - refreshing to see. And yet, no numbers! :-) > > What sort of numbers did you want to see? FWIW I posted some numbers from a synthetic case to pgsql-hackers http://archives.postgresql.org/pgsql...2/msg00088.php Very promising -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |