This is a discussion on RAID arrays and performance within the Pgsql Performance forums, part of the PostgreSQL category; --> Mark Mielke wrote: > At a minimum, this breaks your query into: 1) Preload all the index > pages ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Mark Mielke wrote: > At a minimum, this breaks your query into: 1) Preload all the index > pages you will need Isn't this fairly predictable - the planner has chosen the index so it will be operating on a bounded subset. > , 2) Scan the index pages you needed Yes, and AIO helps when you can scan them in arbitrary order, as they are returned. > , 3) Preload all the table page you will need No - just request that they load. You can do work as soon as any page is returned. > , 4) Scan the table pages you needed. In the order that is most naturally returned by the disks. > But do you really need the whole index? I don't think I suggested that. > What if you only need parts of the index, and this plan now reads the > whole index using async I/O "just in case" it is useful? Where did you get that from? > index scan into a regular B-Tree scan, which is difficult to perform > async I/O for, as you don't necessarily know which pages to read next. > Most B-trees are not so deep. It would generally be a win to retain interior nodes of indices in shared memory, even if leaf pages are not present. In such a case, it is quite quick to establish which leaf pages must be demand loaded. I'm not suggesting that Postgres indices are structured in a way that would support this sort of thing now. James ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| James Mansion wrote: > Mark Mielke wrote: >> At a minimum, this breaks your query into: 1) Preload all the index >> pages you will need > Isn't this fairly predictable - the planner has chosen the index so it > will be operating > on a bounded subset. What is the bounded subset? It is bounded by the value. What value? You need to access the first page before you know what the second page is. PostgreSQL or the kernel should already have the hottest pages in memory, so the value of doing async I/O is very likely the cooler pages that are unique to the query. We don't know what the cooler pages are until we follow three tree down. >> , 2) Scan the index pages you needed > Yes, and AIO helps when you can scan them in arbitrary order, as they > are returned. I don't think you are talking about searching a B-Tree, as the order is important when searching, and search performance would be reduced if one reads and scans more pages than necessary to map from the value to the row. I presume you are talking about scanning the entire index. Where "needed" means "all". Again, this only benefits a subset of the queries. >> , 3) Preload all the table page you will need > No - just request that they load. You can do work as soon as any page > is returned. The difference between preload and handling async I/O in terms of performance is debatable. Greg reports that async I/O on Linux sucks, but posix_fadvise*() has substantial benefits. posix_fadvise*() is preload not async I/O (he also reported that async I/O on Solaris seems to work well). Being able to do work as the first page is available is a micro-optimization as far as I am concerned at this point (one that may not yet work on Linux), as the real benefit comes from utilizing all 12 disks in Matthew's case, not from guaranteeing that data is processed as soon as possible. >> , 4) Scan the table pages you needed. > In the order that is most naturally returned by the disks. Micro-optimization. >> But do you really need the whole index? > I don't think I suggested that. >> What if you only need parts of the index, and this plan now reads the >> whole index using async I/O "just in case" it is useful? > Where did you get that from? I get it from your presumption that you can know which pages of the index to load in advance. The only way you can know which pages must be loaded, is to know that you need to query them all. Unless you have some way of speculating with some degree of accuracy which colder pages in the index you will need, without looking at the index? >> index scan into a regular B-Tree scan, which is difficult to perform >> async I/O for, as you don't necessarily know which pages to read next. > Most B-trees are not so deep. It would generally be a win to retain > interior nodes of indices in > shared memory, even if leaf pages are not present. In such a case, it > is quite quick to establish > which leaf pages must be demand loaded. This is bogus. The less deep the B-Tree is, the less there should be any requirement for async I/O. Hot index pages will be in cache. > I'm not suggesting that Postgres indices are structured in a way that > would support this sort > of thing now. In your hand waving, you have assumed that PostgreSQL B-Tree index might need to be replaced? :-) Cheers, mark -- Mark Mielke <mark@mielke.cc> ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| Mark Mielke wrote: > PostgreSQL or the kernel should already have the hottest pages in > memory, so the value of doing async I/O is very likely the cooler > pages that are unique to the query. We don't know what the cooler > pages are until we follow three tree down. > I'm assuming that at the time we start to search the index, we have some idea of value or values that we are looking for. Or, as you say, we are applying a function to 'all of it'. Think of a 'between' query. The subset of the index that can be a match can be bounded by the leaf pages that contain the end point(s). Similarly if we have a merge with a sorted intermediate set from a prior step then we also have bounds on the values. I'm not convinced that your assertion that the index leaf pages must necessarily be processed in-order is true - it depends what sort of operation is under way. I am assuming that we try hard to keep interior index nodes and data in meory and that having identified the subset of these that we want, we can immediately infer the set of leaves that are potentially of interest. > The difference between preload and handling async I/O in terms of > performance is debatable. Greg reports that async I/O on Linux sucks, > but posix_fadvise*() has substantial benefits. posix_fadvise*() is > preload not async I/O (he also reported that async I/O on Solaris > seems to work well). Being able to do work as the first page is > available is a micro-optimization as far as I am concerned at this > point (one that may not yet work on Linux), as the real benefit comes > from utilizing all 12 disks in Matthew's case, not from guaranteeing > that data is processed as soon as possible. > I see it as part of the same problem. You can partition the data across all the disks and run queries in parallel against the partitions, or you can lay out the data in the RAID array in which case the optimiser has very little idea how the data will map to physical data layout - so its best bet is to let the systems that DO know decide the access strategy. And those systems can only do that if you give them a lot of requests that CAN be reordered, so they can choose a good ordering. > Micro-optimization. > Well, you like to assert this - but why? If a concern is the latency (and my experience suggests that latency is the biggest issue in practice, not throughput per se) then overlapping processing while waiting for 'distant' data is important - and we don't have any information about the physical layout of the data that allows us to assert that forward access pre-read of data from a file is the right strategy for accessing it as fast as possible - we have to allow the OS (and to an increasing extent the disks) to manage the elevator IO to best effect. Its clear that the speed of streaming read and write of modern disks is really high compared to that of random access, so anything we can do to help the disks run in that mode is pretty worthwhile even if the physical streaming doesn't match any obvious logical ordering of the OS files or logical data pages within them. If you have a function to apply to a set of data elements and the elements are independant, then requiring that the function is applied in an order rather than conceptually in parallel is going to put a lot of constraint on how the hardware can optimise it. Clearly a hint to preload is better than nothing. But it seems to me that the worst case is that we wait for the slowest page to load and then start processing hoping that the rest of the data stays in the buffer cache and is 'instant', while AIO and evaluate-when-ready means that process is still bound by the slowest data to arrive, but at that point there is little processing still to do, and the already-processed buffers can be reused earlier. In the case where there is significant presure on the buffer cache, this can be significant. Of course, a couple of decades bullying Sybase systems on Sun Enterprise boxes may have left me somewhat jaundiced - but Sybase can at least parallelise things. Sometimes. When it does, its quite a big win. > In your hand waving, you have assumed that PostgreSQL B-Tree index > might need to be replaced? :-) > Sure, if the intent is to make the system thread-hot or AIO-hot, then the change is potentially very invasive. The strategy to evaluate queries based on parallel execution and async IO is not necessarily very like a strategy where you delegate to the OS buffer cache. I'm not too bothered for the urpose of this discussion whether the way that postgres currently navigates indexes is amenable to this. This is bikeshed land, right? I think it is foolish to disregard strategies that will allow overlapping IO and processing - and we want to keep disks reading and writing rather than seeking. To me that suggests AIO and disk-native queuing are quite a big deal. And parallel evaluation will be too as the number of cores goes up and there is an expectation that this should reduce latency of individual query, not just allow throughput with lots of concurrent demand. ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq |
| |||
| James Mansion wrote: > Mark Mielke wrote: >> PostgreSQL or the kernel should already have the hottest pages in >> memory, so the value of doing async I/O is very likely the cooler >> pages that are unique to the query. We don't know what the cooler >> pages are until we follow three tree down. >> > I'm assuming that at the time we start to search the index, we have > some idea of value or values that > we are looking for. Or, as you say, we are applying a function to > 'all of it'. > > Think of a 'between' query. The subset of the index that can be a > match can be bounded by the leaf > pages that contain the end point(s). Similarly if we have a merge > with a sorted intermediate set from > a prior step then we also have bounds on the values. How do you find the bounding points for these pages? Does this not require descending through the tree in a more traditional way? > I'm not convinced that your assertion that the index leaf pages must > necessarily be processed in-order > is true - it depends what sort of operation is under way. I am > assuming that we try hard to keep > interior index nodes and data in meory and that having identified the > subset of these that we want, we > can immediately infer the set of leaves that are potentially of interest. It is because you are missing my point. In order to find a reduced set of pages to load, one must descend through the B-Tree. Identifying the subset requires sequential loading of pages. There is some theoretical potential for async I/O, but I doubt your own assertion that this is significant in any significant way. I ask you again - how do you know which lower level pages to load before you load the higher level pages? The only time the B-Tree can be processed out of order in this regard is if you are doing a bitmap index scan or some similar operation that will scan the entire tree, and you do not care what order the pages arrive in. If you are looking for a specific key, one parent page leads to one child page, and the operation is sequential. >> The difference between preload and handling async I/O in terms of >> performance is debatable. Greg reports that async I/O on Linux sucks, >> but posix_fadvise*() has substantial benefits. posix_fadvise*() is >> preload not async I/O (he also reported that async I/O on Solaris >> seems to work well). Being able to do work as the first page is >> available is a micro-optimization as far as I am concerned at this >> point (one that may not yet work on Linux), as the real benefit comes >> from utilizing all 12 disks in Matthew's case, not from guaranteeing >> that data is processed as soon as possible. > I see it as part of the same problem. You can partition the data > across all the disks and run queries in parallel > against the partitions, or you can lay out the data in the RAID array > in which case the optimiser has very little idea > how the data will map to physical data layout - so its best bet is to > let the systems that DO know decide the > access strategy. And those systems can only do that if you give them > a lot of requests that CAN be reordered, > so they can choose a good ordering. You can see it however you like - what remains, is that the 12X speed up is going to come from using 12 disks, and loading the 12 disks in parallel. More theoretical improvements with regard to the ability for a particular hard disk to schedule reads and return results out of order, have not, in my reading, been shown to reliably improve performance to a significant degree. Unfortunately, Linux doesn't support my SATA NCQ yet, so I haven't been able to experiment myself. Gregory provided numbers showing a 2X - 3X performance when using three disks. This has the potential for significant improvement with real hardware, modest cost, and perhaps trivial changes to PostgreSQL. What you describe is a re-design of the I/O strategy that will be designed for asynchronous I/O, with some sort of state machine that will be able to deal efficiently with either index pages or table pages out of order. Do you have numbers to show that such a significant change would result in a reasonable return on the investment? >> Micro-optimization. > Well, you like to assert this - but why? I'll quote from: http://en.wikipedia.org/wiki/Native_Command_Queuing Most reading I have done shows NCQ to have limited gains, with most benchmarks being suspect. Also, latency is only for the first page. One presumes that asynch I/O will be mostly valuable in the case where many pages can be scheduled for reading at the same time. In the case that many pages are scheduled for reading, the first page will be eventually served, and the overall bandwidth is still the same. In your theoretical model, you are presuming the CPU is a bottleneck, either for processing, or scheduling the next batch of reads. I think you are hand waving, and given that Linux doesn't yet support asynch I/O well, Gregory's model will serve more PostgreSQL users than yours with less complexity. > Clearly a hint to preload is better than nothing. But it seems to me > that the worst case is that we wait for > the slowest page to load and then start processing hoping that the > rest of the data stays in the buffer cache > and is 'instant', while AIO and evaluate-when-ready means that process > is still bound by the slowest > data to arrive, but at that point there is little processing still to > do, and the already-processed buffers can be > reused earlier. In the case where there is significant presure on the > buffer cache, this can be significant. It seems to me that you have yet to prove that there will be substantial gains in overall performance for preload. Leaping on to presuming that PostgreSQL should switch to a fully asynch I/O model seems a greater stretch. By the sounds of it, Gregory could have the first implemented very soon. When will you have yours? :-) >> In your hand waving, you have assumed that PostgreSQL B-Tree index >> might need to be replaced? :-) > Sure, if the intent is to make the system thread-hot or AIO-hot, then > the change is potentially very > invasive. The strategy to evaluate queries based on parallel > execution and async IO is not necessarily > very like a strategy where you delegate to the OS buffer cache. > > I'm not too bothered for the urpose of this discussion whether the way > that postgres currently > navigates indexes is amenable to this. This is bikeshed land, right? I am only interested by juicy projects that have a hope of success. This subject does interest me - I am hoping my devil's advocate participation encourages people to seek a practical implementation that will benefit me. > I think it is foolish to disregard strategies that will allow > overlapping IO and processing - and we want to > keep disks reading and writing rather than seeking. To me that > suggests AIO and disk-native queuing > are quite a big deal. And parallel evaluation will be too as the > number of cores goes up and there is > an expectation that this should reduce latency of individual query, > not just allow throughput with lots > of concurrent demand. I am more amenable to multi-threaded index processing for the same query than async I/O to take advantage of NCQ. Guess we each come from a different background. :-) Cheers, mark -- Mark Mielke <mark@mielke.cc> ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Tue, 4 Dec 2007, Mark Mielke wrote: >> This is bikeshed land, right? > I am only interested by juicy projects that have a hope of success. This > subject does interest me - I am hoping my devil's advocate participation > encourages people to seek a practical implementation that will benefit me. Nah, you're both in bikeshed land. Ultimately this feature is going to get built out in a way that prioritizes as much portability as is possible while minimizing the impact on existing code. The stated PostgreSQL bias is that async-I/O is not worth the trouble until proven otherwise: http://www.postgresql.org/docs/faqs.....html#item1.14 That's why Greg Stark is busy with low-level tests of I/O options while you're arguing much higher level details. Until you've got some benchmarks in this specific area to go by, you can talk theory all day and not impact what implementation will actually get built one bit. As an example of an area you've already brought up where theory and benchmarks clash violently, the actual situation with NCQ on SATA drives has been heavily blurred because of how shoddy some manufacturer's NCQ implementation are. Take a look at the interesting thread on this topic at http://lkml.org/lkml/2007/4/3/159 to see what I'm talking about. Even if you have an NCQ drive, and a version of Linux that supports it, and you've setup everything up right, you can still end up with unexpected performance regressions under some circumstances. It's still the wild west with that technology. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Mark Mielke wrote: > Asynchronous I/O is no more a magic bullet than threading. It requires a > lot of work to get it right, and if one gets it wrong, it can be slower > than the regular I/O or single threaded scenarios. Both look sexy on > paper, neither may be the solution to your problem. Or they may be. We > wouldn't know without numbers. Agreed. We currently don't use multiple CPUs or multiple disks efficiently for single-query loads. There is certainly more we could do in these areas, and it is on the TODO list. The good news is that most work loads are multi-user and we use resources more evenly in those cases. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://postgres.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| On Thu, 18 Sep 2008, Matthew Wakeling wrote: > On Tue, 29 Jan 2008, Gregory Stark wrote: >> I have a patch which implements it for the low hanging fruit of bitmap >> index scans. it does it using an extra trip through the buffer manager >> which is the least invasive approach but not necessarily the best. > > Gregory - what's the status of that patch at the moment? Will it be making it > into a new version of Postgres, or are we waiting for it to be implemented > fully? It and a related fadvise patch have been floating around the project queue for a while now. I just sorted through all the various patches and mesasges related to this area and updated the list at http://wiki.postgresql.org/wiki/Comm...ending_patches recently, I'm trying to kick back a broader reviewed version of this concept right now. > It's just that our system is doing a lot of bitmap index scans at the moment, > and it'd help to be able to spread them across the 16 discs in the RAID > array. It's the bottleneck in our system at the moment. If you have some specific bitmap index scan test case suggestions you can pass along (either publicly or in private to me, I can probably help anonymize them), that's one of the things that has been holding this up. Alternately, if you'd like to join in on testing this all out more help would certainly be welcome. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On Thu, Sep 18, 2008 at 1:30 PM, Greg Smith <gsmith@gregsmith.com> wrote: > If you have some specific bitmap index scan test case suggestions you can > pass along (either publicly or in private to me, I can probably help > anonymize them), that's one of the things that has been holding this up. > Alternately, if you'd like to join in on testing this all out more help > would certainly be welcome. I posted in pgsql-perform about a problem that's using bitmap heap scans that's really slow compared to just using nested-loops. Don't know if that is relevant or not. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| |||
| On Thu, 18 Sep 2008, Greg Smith wrote: >> It's just that our system is doing a lot of bitmap index scans at the >> moment, and it'd help to be able to spread them across the 16 discs in the >> RAID array. It's the bottleneck in our system at the moment. > > If you have some specific bitmap index scan test case suggestions you can > pass along (either publicly or in private to me, I can probably help > anonymize them), that's one of the things that has been holding this up. Okay, here's a description of what we're doing. We are importing data from a large file (so we don't have a choice on the order that the data comes in). For each entry in the source, we have to look up the corresponding row in the database, and issue the correct "UPDATE" command according to what's in the database and what's in the data source. Long story - it isn't feasible to change the overall process. In order to improve the performance, I made the system look ahead in the source, in groups of a thousand entries, so instead of running: SELECT * FROM table WHERE field = 'something'; a thousand times, we now run: SELECT * FROM table WHERE field IN ('something', 'something else'...); with a thousand things in the IN. Very simple query. It does run faster than the individual queries, but it still takes quite a while. Here is an example query: SELECT a1_.id AS a1_id, a1_.primaryIdentifier AS a2_ FROM Gene AS a1_ WHERE a1_.primaryIdentifier IN ('SPAC11D3.15', 'SPAC11D3.16c', 'SPAC11D3.17', 'SPAC11D3.18c', 'SPAC15F9.01c', 'SPAC15F9.02', 'SPAC16.01', 'SPAC18G6.01c', 'SPAC18G6.02c', 'SPAC18G6.04c', 'SPAC18G6.05c', 'SPAC18G6.06', 'SPAC18G6.07c', 'SPAC18G6.09c', 'SPAC18G6.10', 'SPAC18G6.11c', 'SPAC18G6.12c', 'SPAC18G6.13', 'SPAC18G6.14c', 'SPAC18G6.15', 'SPAC1B9.01c', 'SPAC1D4.02c', 'SPAC1D4.03c', 'SPAC1D4.04', 'SPAC1D4.05c', 'SPAC1D4.07c', 'SPAC1D4.08', 'SPAC1D4.09c', 'SPAC1D4.10', 'SPAC1D4.11c', 'SPAC1F3.11', 'SPAC23A1.10', 'SPAC23E2.01', 'SPAC23E2.02', 'SPAC23E2.03c', 'SPAC26A3.02', 'SPAC26A3.03c', 'SPAC26A3.05', 'SPAC26A3.06', 'SPAC26A3.07c', 'SPAC26A3.08', 'SPAC26A3.09c', 'SPAC26A3.10', 'SPAC26A3.11', 'SPAC26A3.14c', 'SPAC26A3.15c', 'SPAC26A3.16', 'SPAC27F1.01c', 'SPAC27F1.03c', 'SPAC27F1.04c', 'SPAC27F1.05c', 'SPAC27F1.06c', 'SPAC3H8.02', 'SPAC3H8.03', 'SPAC3H8.04', 'SPAC3H8.05c', 'SPAC3H8.06', 'SPAC3H8.07c', 'SPAC3H8.08c', 'SPAC3H8.09c', 'SPAC3H8.10', 'SPAC3H8.11', 'SPAC8E11.11', 'SPBC106.15', 'SPBC17G9.10', 'SPBC24E9.15c', 'WBGene00000969', 'WBGene00003035', 'WBGene00004095', 'WBGene00016011', 'WBGene00018672', 'WBGene00018674', 'WBGene00018675', 'WBGene00018676', 'WBGene00018959', 'WBGene00018960', 'WBGene00018961', 'WBGene00023407') ORDER BY a1_.id LIMIT 2000; And the corresponding EXPLAIN ANALYSE: Limit (cost=331.28..331.47 rows=77 width=17) (actual time=121.973..122.501 rows=78 loops=1) -> Sort (cost=331.28..331.47 rows=77 width=17) (actual time=121.968..122.152 rows=78 loops=1) Sort Key: id Sort Method: quicksort Memory: 29kB -> Bitmap Heap Scan on gene a1_ (cost=174.24..328.87 rows=77 width=17) (actual time=114.311..121.705 rows=78 loops=1) Recheck Cond: (primaryidentifier = ANY ('{SPAC11D3.15... -> Bitmap Index Scan on gene__key_primaryidentifier (cost=0.00..174.22 rows=77 width=0) (actual time=44.434..44.434 rows=150 loops=1) Index Cond: (primaryidentifier = ANY ('{SPAC11D3.15,SPAC11D3.16c... Total runtime: 122.733 ms (9 rows) Although it's probably in the cache, as it took 1073 ms the first time. The table has half a million rows, but tables all over the database are being accessed, so the cache is shared between several hundred million rows. Postgres executes this query in two stages. First it does a trawl of the index (on field), and builds an bitmap. Then it fetches the pages according to the bitmap. I can see the second stage being quite easy to adapt for fadvise, but the first stage would be a little more tricky. Both stages are equally important, as they take a comparable amount of time. We are running this database on a 16-spindle RAID array, so the benefits to our process of fully utilising them would be quite large. I'm considering if I can parallelise things a little though. > Alternately, if you'd like to join in on testing this all out more help would > certainly be welcome. How would you like me to help? Matthew -- What goes up must come down. Ask any system administrator. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |
| ||||
| Matthew Wakeling <matthew@flymine.org> writes: > In order to improve the performance, I made the system look ahead in the > source, in groups of a thousand entries, so instead of running: > SELECT * FROM table WHERE field = 'something'; > a thousand times, we now run: > SELECT * FROM table WHERE field IN ('something', 'something else'...); > with a thousand things in the IN. Very simple query. It does run faster > than the individual queries, but it still takes quite a while. Here is an > example query: Your example shows the IN-list as being sorted, but I wonder whether you actually are sorting the items in practice? If not, you might try that to improve locality of access to the index. Also, parsing/planning time could be part of your problem here with 1000 things to look at. Can you adjust your client code to use a prepared query? I'd try SELECT * FROM table WHERE field = ANY($1::text[]) (or whatever the field datatype actually is) and then push the list over as a single parameter value using array syntax. You might find that it scales to much larger IN-lists that way. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance |