This is a discussion on Query optimizer 8.0.1 (and 8.0) within the pgsql Hackers forums, part of the PostgreSQL category; --> Here's one: I have the USA census TIGER database loaded, the WHOLE THING, the whole country. It isn't the ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| Here's one: I have the USA census TIGER database loaded, the WHOLE THING, the whole country. It isn't the biggest database, but it is about 40G before indexes. Every table is over a few million rows. I can quite honestly say, a sequential scan is almost never the right thing to do. Info below. Here is the query: select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or zipl=2186); Well, when you look at the explain, obviously it looks like the sequential scan is better, but trust me, it takes a good few minutes for the query to respond with sequential scan enabled. I've run analyze and everything. I suspect that analyze only samples a very small amount of the database and gets the wrong idea about it. Is there a way to force analyze to sample more rows? tiger=# analyze verbose rt2 ; INFO: analyzing "public.rt2" INFO: "rt2": scanned 3000 of 1139825 pages, containing 84160 live rows and 0 dead rows; 3000 rows in sample, 31975891 estimated total rows ANALYZE tiger=# analyze verbose rt1 ; INFO: analyzing "public.rt1" INFO: "rt1": scanned 3000 of 1527360 pages, containing 90951 live rows and 0 dead rows; 3000 rows in sample, 46304973 estimated total rows ANALYZE tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or zipl=2186); QUERY PLAN ----------------------------------------------------------------------------------------------------- Hash Join (cost=121978.81..3996190.89 rows=21118 width=520) Hash Cond: ("outer".tlid = "inner".tlid) -> Seq Scan on rt2 (cost=0.00..1459291.36 rows=31946636 width=218) -> Hash (cost=120662.36..120662.36 rows=30579 width=302) -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..120662.36 rows=30579 width=302) Index Cond: ((zipr = 2186) OR (zipl = 2186)) (6 rows) tiger=# set enable_seqscan=no; SET tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr = 2186 or zipl=2186); QUERY PLAN ----------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..46868256.00 rows=21118 width=520) -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..120662.36 rows=30579 width=302) Index Cond: ((zipr = 2186) OR (zipl = 2186)) -> Index Scan using rt2_tlid on rt2 (cost=0.00..1523.80 rows=396 width=218) Index Cond: ("outer".tlid = rt2.tlid) Table "public.rt1" Column | Type | Modifiers -----------+-------------------+----------- tlid | integer | side1 | character varying | source | character varying | fedirp | character varying | fename | character varying | fetype | character varying | fedirs | character varying | cfcc | character varying | fraddl | character varying | toaddl | character varying | fraddr | character varying | toaddr | character varying | friaddl | character varying | toiaddl | character varying | friaddr | character varying | toiaddr | character varying | zipl | integer | zipr | integer | aianhhfpl | character varying | aianhhfpr | character varying | aihhtlil | character varying | aihhtlir | character varying | statel | character varying | stater | character varying | countyl | character varying | countyr | character varying | cousubl | character varying | cousubr | character varying | submcdl | character varying | submcdr | character varying | placel | character varying | placer | character varying | tractl | character varying | tractr | character varying | blockl | character varying | blockr | character varying | frlong | numeric | frlat | numeric | tolong | numeric | tolat | numeric | Indexes: "rt1_fename" btree (fename) "rt1_frlat" btree (frlat) "rt1_frlong" btree (frlong) "rt1_tlid" btree (tlid) "rt1_tolat" btree (tolat) "rt1_tolong" btree (tolong) "rt1_zipl" btree (zipl) "rt1_zipr" btree (zipr) Table "public.rt2" Column | Type | Modifiers --------+---------+----------- tlid | integer | rtsq | integer | long1 | numeric | lat1 | numeric | long2 | numeric | lat2 | numeric | long3 | numeric | lat3 | numeric | long4 | numeric | lat4 | numeric | long5 | numeric | lat5 | numeric | long6 | numeric | lat6 | numeric | long7 | numeric | lat7 | numeric | long8 | numeric | lat8 | numeric | long9 | numeric | lat9 | numeric | long10 | numeric | lat10 | numeric | Indexes: "rt2_tlid" btree (tlid) ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org |
| |||
| pgsql@mohawksoft.com writes: > -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 > rows=30835 width=302) > Index Cond: ((zipr = 2186) OR (zipl = 2186)) > zipl | 925 | > zipr | 899 | Those n_distinct values for zipl and zipr seem aberrant --- too low compared to the estimated rowcount you're showing. What are the true figures? Also, how about some EXPLAIN ANALYZEs and not just EXPLAINs? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| pgsql@mohawksoft.com writes: > I suspect that analyze only samples a very small amount of the database > and gets the wrong idea about it. Is there a way to force analyze to > sample more rows? default_statistics_target. But let's see the pg_stats rows for these columns before assuming that analyze is getting it wrong. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| |||
| > pgsql@mohawksoft.com writes: >> I suspect that analyze only samples a very small amount of the database >> and gets the wrong idea about it. Is there a way to force analyze to >> sample more rows? > > default_statistics_target. But let's see the pg_stats rows for these > columns before assuming that analyze is getting it wrong. Some more info: I did a select count(distinct(tlid)) from rt2, and updated the statistics with the result: tiger=# update pg_statistic set stadistinct = 23656799 where starelid = 17236 and staattnum = 1; UPDATE 1 tiger=# explain select * from rt1, rt2 where rt1.tlid = rt2.tlid and (zipr tiger(# = 2186 or zipl=2186); QUERY PLAN ----------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..425517.95 rows=21315 width=520) -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 rows=30835 width=302) Index Cond: ((zipr = 2186) OR (zipl = 2186)) -> Index Scan using rt2_tlid on rt2 (cost=0.00..9.82 rows=2 width=218) Index Cond: ("outer".tlid = rt2.tlid) (5 rows) tiger=# SELECT attname, n_distinct, most_common_vals FROM pg_stats WHERE tablename = 'rt1'; attname | n_distinct | most_common_vals -----------+------------+----------------------------------------------------------------------------------------------------------------------- tlid | -1 | side1 | 1 | {1} source | 9 | {B,J,A,K,L,N,O,M,C} fedirp | 8 | {N,E,S,W,SW,NE,NW,SE} fename | 2590 | {Main,1st,Oak,2nd,9th,"Burlington Northern Santa Fe R",Park,4th,11th,8th} fetype | 26 | {Rd,St,Ave,Dr} fedirs | 9 | {NE,SE,N,E,NW,SW,W,S,O} cfcc | 51 | {A41,H12,H11,F10,A74,A31,H01} fraddl | 767 | {1,101,2,201,301,401,100,701,298,500} toaddl | 805 | {199,399,299,499,98,198,99,1,100,2} fraddr | 765 | {2,1,100,200,400,300,700,101,501,299} toaddr | 772 | {198,398,298,498,99,98,199,101,1,1098} friaddl | 3 | {0,1,2} toiaddl | 3 | {0,1,2} friaddr | 3 | {0} toiaddr | 3 | {0} zipl | 925 | zipr | 899 | aianhhfpl | 42 | aianhhfpr | 43 | aihhtlil | 2 | aihhtlir | 2 | statel | 55 | {48,06,12,37,29,42,17,36,13,39} stater | 55 | {48,06,12,37,29,42,17,36,13,39} countyl | 189 | {005,059,013,029,003,001,017,009,031,043} countyr | 191 | {005,059,013,001,003,029,017,025,031,043} cousubl | 2568 | {92601,91800,90270,90240,90468,90572,91508,91750,6 0000,90324} cousubr | 2598 | {92601,91800,90270,90240,90468,90572,91248,91750,6 0000,90324} submcdl | -1 | submcdr | -1 | placel | 778 | {51000,65000,44000,12000,38000,60000,63460,07000,0 4000,22000} placer | 787 | {51000,65000,12000,44000,60000,07000,38000,55000,6 3460,04000} tractl | 1370 | {950200,950100,950300,000100,970100,960100,980100, 950700,970300,990100} tractr | 1354 | {950200,950100,950300,000100,970100,960100,990100, 950700,980100,970300} blockl | 1050 | {1000,2000,1001,1003,1005,2001,1009,2006,1002,1004 } blockr | 1055 | {1000,2000,1001,1002,1003,1005,1004,1009,2004,2002 } frlong | 134476 | {-120.214657,-113.074100,-106.494480,-103.306945,-100.184470,-100.083614,-99.476994,-98.420248,-97.325498,-93.349865} frlat | 143222 | {27.759896,29.251454,29.898585,30.093247,31.814071 ,31.950913,32.055726,32.377503,32.523607,32.607387 } tolong | 317744 | {-123.330861,-111.673035,-107.596898,-103.164000,-100.945693,-100.080307,-99.576886,-99.492719,-97.743722,-93.870222} tolat | 278079 | {27.493816,27.904316,29.691644,32.731410,33.350429 ,34.490563,35.551053,35.868297,39.139185,40.068098 } (40 rows) ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) |
| |||
| > pgsql@mohawksoft.com writes: >> -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 >> rows=30835 width=302) >> Index Cond: ((zipr = 2186) OR (zipl = 2186)) > >> zipl | 925 | >> zipr | 899 | > > Those n_distinct values for zipl and zipr seem aberrant --- too low > compared to the estimated rowcount you're showing. What are the > true figures? Also, how about some EXPLAIN ANALYZEs and not just > EXPLAINs? > > ^ tiger=# select count(distinct(zipr)), count(distinct(zipl)) from rt1 ; count | count -------+------- 35133 | 35393 (1 row) ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org |
| |||
| Short summary: I had the same problem - since the sort order of zip-codes, counties, city names, and states don't match, the optimizer grossly overestimated the number of pages that would be read. I bet doing a CLUSTER by ZIP would solve that particular query, but would break similar queries by county or by city or by state. I think select attname,correlation from pg_stats where tablename = 'rt1'; will show you the same problem I had. My pg_stats numbers are shown below. I had a couple ugly hacks that worked around the problem for myself. Longer. One interesting property of the TIGER data (and most geospatial databases) is that the data for most of the columns are "locally correlated" but not globally. What I mean by that is that even though data for those zip-codes probably only resided in a few disk pages; the data was probably loaded in order of Tiger File number (state,county), so the "correlation" in pg_stats for zip-code was very low. With the low correlation the optimizer couldn't see the fact that any given zip code's data was all together on disk. Because of this, it probably vastly overestimated the number of pages that would be read. Let me try a concrete example: Zip | City | State | County | Street ------+---------------+----------+---------------+------------- 99501 } Anchorage } AK | Anchorage | 1st st. [...] 94105 | San Francisco | CA | San Francisco | 1st St 94105 | San Francisco | CA | San Francisco | 2nd St [... a few more disk pages for 94105 ...] [... tens more disk pages for San Francisco ...] [... thousands more disk pages for CA ...] 94501 | Alameda | CA | Alameda | 1st St 94501 | Alameda | CA | Alameda | 2nd St [...] 02101 | Boston | MA | Suffolk | 1st St. Note, that all the data for any geographical region (zip, or city, or county) is located close together on disk. If I do a query by City, or by State, or by Zip, I should probably do an index scan. But since the correlation statistic only looks the total ordering; if we order the table so the correlation for one column, it's statistic will look very good; but since the columns have a different sort order the correlation statistic for the others will be very poor. In my copy of the TIGER data (loaded in order of TIGER-file-number (which is ordered by state, and then by county)), you can see the various relevant correlation values in my database. fl=# select attname,correlation from pg_stats where tablename = 'tgr_rt1'; attname | correlation -----------+-------------- tigerfile | 1 rt | 1 version | 0.968349 tlid | 0.151139 [...] zipl | 0.381979 zipr | 0.376332 [...] statel | 0.998373 stater | 0.99855 countyl | 0.111207 countyr | 0.115345 cousubl | -0.0450375 cousubr | -0.0486589 [...] placel | 0.304117 placer | 0.306714 tractl | 0.141538 tractr | 0.134357 blockl | 0.00599286 blockr | -8.73298e-05 frlong | 0.0857337 frlat | 0.174396 tolong | 0.0857399 tolat | 0.174434 (45 rows) Note, that even though the TIGER data is sorted by State/County, "countyl" and "countyr" have some of the worst correlations (in the stats table); because of the way the FIPS codes work. Every state re-cycles the county codes starting with 1 and going up. STATE FIPS CODE | County FIPS code ------------------+----------------- 06 (california) | 001 (alameda) 06 (california) | 003 (alpine) ... 25 (massachusets) | 001 (Barnstable) I have a hack for 7.4 that sets the numbers hidden in pg_statistic used for correlation to 0.75 for these columns; and the planner started making more reasonable guesses. In the end, though, I just resorted to uglier hacks that make the planner favor index scans like setting random_page_cost artificially low. What I think I'd like to see is for there to be another statistic similar to "correlation" but rather than looking at the total-ordering of the table, to look how correlated values within any single page are. If someone pointed me in the right direction, I might try doing this. Ron PS: I think lots of other data has the same issues. A very large name database ordered by "lastname,firstname" will have all people of a given "firstname" on a relatively small set of pages, but the current correlation value wouldn't see that. Firstname | Lastname | Middlename Adam | Brown | Albert Adam | Brown | Alex Adam | Brown | Bob Bill } Brown | .... ... Adam | Smith | Albert Adam | Smith | Alex Adam | Smith | Bob ... Tom Lane wrote: > pgsql@mohawksoft.com writes: > >> -> Index Scan using rt1_zipr, rt1_zipl on rt1 (cost=0.00..121893.93 >>rows=30835 width=302) >> Index Cond: ((zipr = 2186) OR (zipl = 2186)) > > >> zipl | 925 | >> zipr | 899 | > > > Those n_distinct values for zipl and zipr seem aberrant --- too low > compared to the estimated rowcount you're showing. What are the > true figures? Also, how about some EXPLAIN ANALYZEs and not just > EXPLAINs? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend > |
| |||
| * pgsql@mohawksoft.com (pgsql@mohawksoft.com) wrote: > I have the USA census TIGER database loaded, the WHOLE THING, the whole > country. It isn't the biggest database, but it is about 40G before > indexes. Every table is over a few million rows. I can quite honestly say, > a sequential scan is almost never the right thing to do. Cool, I've got it loaded here too w/ PostGIS. It's good stuff. curious what all you're doing with it and especially if you're using mapserver on it or have found a way to identify roads at a more street/city/state/country level, I've had problems doing that. > I suspect that analyze only samples a very small amount of the database > and gets the wrong idea about it. Is there a way to force analyze to > sample more rows? Yup: alter table <blah> alter column <blah> set statistics <blah> As I recall, the default is 100 for statistics, I'd say increase that number to like 200 or 300 or more and see if it helps. Stephen -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (GNU/Linux) iD8DBQFCA/XerzgMPqB3kigRArIfAJ45RKUQXVbu2u9cXvKITTkED2TbsACf T/dD 2uBgOqBCN9FGbCOtUVBK098= =10cT -----END PGP SIGNATURE----- |
| |||
| Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > What I think I'd like to see is for there to be > another statistic similar to "correlation" but rather > than looking at the total-ordering of the table, to > look how correlated values within any single page are. Our current approach to correlation is surely far too simplistic; it was just a quick hack to get *something* in there for this consideration. I don't personally know enough about statistics to do much better :-( regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| pgsql@mohawksoft.com writes: > One of the things that is disturbing to me about the analyze settings is > that it wants to sample the same number of records from a table regardless > of the size of that table. The papers that I looked at say that this rule has a good solid statistical foundation, at least for the case of estimating histograms. See the references cited in analyze.c. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| ||||
| > pgsql@mohawksoft.com writes: >> One of the things that is disturbing to me about the analyze settings is >> that it wants to sample the same number of records from a table >> regardless >> of the size of that table. > > The papers that I looked at say that this rule has a good solid > statistical foundation, at least for the case of estimating histograms. > See the references cited in analyze.c. Any and all random sampling assumes a degree of uniform distribution. This is the basis of the model. It assumes that chunks of the whole will be representative of the whole (to some degree). This works when normal variations are more or less distributed uniformly. As variations and trends becomes less uniformly distributed, more samples are required to characterize it. Douglas Adams had a great device called the "Total Perspective Vortex" which infered the whole of the universe from a piece of fairy cake. It was a subtle play on the absurd notion that a very small sample could lead to an understanding of an infinitly larger whole. On a very basic level, why bother sampling the whole table at all? Why not check one block and infer all information from that? Because we know that isn't enough data. In a table of 4.6 million rows, can you say with any mathmatical certainty that a sample of 100 points can be, in any way, representative? Another problem with random sampling is trend analysis. Often times there are minor trends in data. Ron pointed out the lastname firstname trend. Although there seems to be no correlation between firstnames in the table, there are clearly groups or clusters of ordered data that is an ordering that is missed by too small a sample. I understand why you chose the Vitter algorithm, because it provides a basically sound methodology for sampling without knowledge of the size of the whole, but I think we can do better. I would suggest using the current algorithm the first time through, then adjust the number of samples [n] based on the previous estimate of the size of the table [N]. Each successive ANALYZE will become more accurate. The Vitter algorithm is still useful as [N] will always be an estimate. ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |