This is a discussion on Thousands of tables versus on table? within the Pgsql Performance forums, part of the PostgreSQL category; --> david@lang.hm wrote: > On Wed, 6 Jun 2007, Steinar H. Gunderson wrote: > >> On Tue, Jun 05, 2007 ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| david@lang.hm wrote: > On Wed, 6 Jun 2007, Steinar H. Gunderson wrote: > >> On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote: >>> I think the main argument for partitioning is when you are interested in >>> being able to drop whole partitions cheaply. >> >> Wasn't there also talk about adding the ability to mark individual >> partitions >> as read-only, thus bypassing MVCC and allowing queries to be satisfied >> using >> indexes only? >> >> Not that I think I've seen it on the TODO... :-) > > now that's a very interesting idea, especially when combined with > time-based data where the old times will never change. That's been discussed, but it's controversial. IMHO a better way to achieve that is to design the dead-space-map so that it can be used to check which parts of a table are visible to everyone, and skip visibility checks. That doesn't require any user action, and allows updates. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| david@lang.hm wrote: > On Tue, 5 Jun 2007, Tom Lane wrote: > >> david@lang.hm writes: >>> however I really don't understand why it is more efficiant to have a 5B >>> line table that you do a report/query against 0.1% of then it is to >>> have >>> 1000 different tables of 5M lines each and do a report/query against >>> 100% >>> of. >> >> Essentially what you are doing when you do that is taking the top few >> levels of the index out of the database and putting it into the >> filesystem; plus creating duplicative indexing information in the >> database's system catalogs. >> >> The degree to which this is a win is *highly* debatable, and certainly >> depends on a whole lot of assumptions about filesystem performance. >> You also need to assume that constraint-exclusion in the planner is >> pretty doggone cheap relative to the table searches, which means it >> almost certainly will lose badly if you carry the subdivision out to >> the extent that the individual tables become small. (This last could > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > what is considered 'small'? a few thousand records, a few million > records? I would say small is when the individual tables are in the 10 to 20 Megabyte range. How many records that is depends on record width, of course. Basically, once the tables get small enough that you don't really need indexes much, since you tend to grab 25% or more of each one that you're going to hit in a query. > what multiplication factor would there need to be on the partitioning > to make it worth while? 100 tables, 1000 tables, 10000 tables? Really depends on the size of the master table I think. If the master table is about 500 Megs in size, and you partition it down to about 1 meg per child table, you're probably ok. Walking through 500 entries for constraint exclusion seems pretty speedy from the tests I've run on a 12M row table that was about 250 Megs, split into 200 to 400 or so equisized child tables. The time to retrieve 85,000 rows that were all neighbors went from 2 to 6 seconds, to about 0.2 seconds, and we got rid of indexes entirely since they weren't really needed anymore. > the company that I'm at started out with a seperate database per > customer (not useing postgres), there are basicly zero cross-customer > queries, with a large volume of updates and lookups. > > overall things have now grown to millions of updates/day (some > multiple of this in lookups), and ~2000 customers, with tens of > millions of rows between them. > > having each one as a seperate database has really helped us over the > years as it's made it easy to scale (run 500 databases on each server > instead of 1000, performance just doubled) I think that for what you're doing, partitioning at the database level is probably a pretty good compromise solution. Like you say, it's easy to put busy databases on a new server to balance out the load. Hardware is cheap. > various people (not database experts) are pushing to install Oracle > cluster so that they can move all of these to one table with a > customerID column. Have these people identified a particular problem they're trying to solve, or is this a religious issue for them? From your description it sounds like a matter of dogma, not problem solving. > the database folks won't comment much on this either way, but they > don't seem enthusiastic to combine all the data togeather. I think they can see the fecal matter heading towards the rotational cooling device on this one. I can't imagine this being a win from the perspective of saving the company money. ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Tom Lane wrote: > The degree to which this is a win is *highly* debatable, and certainly > depends on a whole lot of assumptions about filesystem performance. > You also need to assume that constraint-exclusion in the planner is > pretty doggone cheap relative to the table searches, which means it > almost certainly will lose badly if you carry the subdivision out to > the extent that the individual tables become small. (This last could > be improved in some cases if we had a more explicit representation of > partitioning, but it'll never be as cheap as one more level of index > search.) I did some testing a while back on some of this, and with 400 or so partitions, the select time was still very fast. We were testing grabbing 50-80k rows from 12M at a time, all adjacent to each other. With the one big table and one big two way index method, we were getting linearly increasing select times as the dataset grew larger and larger. The indexes were much larger than available memory and shared buffers. The retrieval time for 50-80k rows was on the order of 2 to 6 seconds, while the retrieval time for the same number of rows with 400 partitions was about 0.2 to 0.5 seconds. I haven't tested with more partitions than that, but might if I get a chance. What was really slow was the inserts since I was using rules at the time. I'd like to try re-writing it to use triggers, since I would then have one trigger on the parent table instead of 400 rules. Or I could imbed the rules into the app that was creating / inserting the data. The insert performance dropped off VERY fast as I went over 100 rules, and that was what primarily stopped me from testing larger numbers of partitions. The select performance stayed very fast with more partitions, so I'm guessing that the constraint exclusion is pretty well optimized. I'll play with it some more when I get a chance. For certain operations like the one we were testing, partitioning seems to pay off big time. ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| david@lang.hm wrote: > various people (not database experts) are pushing to install Oracle > cluster so that they can move all of these to one table with a > customerID column. They're blowing smoke if they think Oracle can do this. One of my applications had this exact same problem -- table-per-customer versus big-table-for-everyone. Oracle fell over dead, even with the best indexing possible, tuned by the experts, and using partitions keyed to the customerID. We ended up breaking it up into table-per-customer because Oracle fell over dead when we had to do a big update on a customer's entire dataset. All other operations were slowed by the additional index on the customer-ID, especially complex joins. With a table-for-everyone, you're forced to create tricky partitioning or clustering, clever indexes, and even with that, big updates are problematic. And once you do this, then you become heavily tied to one RDBMS and your applications are no longer portable, because clustering, indexing, partitioning and other DB tuning tricks are very specific to each RDBMS. When we moved to Postgres, we never revisited this issue, because both Oracle and Postgres are able to handle thousands of tables well. As I wrote in a previous message on a different topic, often the design of your application is more important than the performance. In our case, the table-per-customer makes the applications simpler, and security is MUCH easier. Oracle is simply not better than Postgres in this regard. As far as I know, there is only one specific situation (discussed frequently here) where Oracle is faster: the count(), min() and max() functions, and I know significant progress has been made since I started using Postgres. I have not found any other query where Oracle is significantly better, and I've found several where Postgres is the clear winner. It's telling that Oracle's license contract prohibits you from publishing comparisons and benchmarks. You have to wonder why. Craig ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |
| |||
| Craig James wrote: > > Oracle is simply not better than Postgres in this regard. As far as I > know, there is only one specific situation (discussed frequently here) > where Oracle is faster: the count(), min() and max() functions, and I > know significant progress has been made since I started using > Postgres. I have not found any other query where Oracle is > significantly better, and I've found several where Postgres is the > clear winner. In my testing between a commercial database that cannot be named and postgresql, I found max() / min() to be basically the same, even with where clauses and joins happening. count(*), OTOH, is a still a clear winner for the big commercial database. With smaller sets (1 Million or so) both dbs are in the same ballpark. With 30+million rows, count(*) took 2 minutes on pgsql and 4 seconds on the big database. OTOH, there are some things, like importing data, which are MUCH faster in pgsql than in the big database. ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| Scott Marlowe wrote: > OTOH, there are some things, like importing data, which are MUCH faster > in pgsql than in the big database. An excellent point, I forgot about this. The COPY command is the best thing since the invention of a shirt pocket. We have a database-per-customer design, and one of the mosterous advantages of Postgres is that we can easily do backups. A pg_dump, then scp to a backup server, and in just a minute or two we have a full backup. For recovery, pg_restore is equally fast and amazing. Last time I checked, Oracle didn't have anything close to this. Craig ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| On 6/6/07, Craig James <craig_james@emolecules.com> wrote: > They're blowing smoke if they think Oracle can do this. Oracle could handle this fine. > Oracle fell over dead, even with the best indexing possible, > tuned by the experts, and using partitions keyed to the > customerID. I don't think so, whoever tuned this likely didn't know what they were doing. > It's telling that Oracle's license contract prohibits you from > publishing comparisons and benchmarks. You have to wonder why. They did this for the same reason as everyone else. They don't want non-experts tuning the database incorrectly, writing a benchmark paper about it, and making the software look bad. -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/ ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings |
| |||
| On 6/6/07, Craig James <craig_james@emolecules.com> wrote: > Last time I checked, Oracle didn't have anything close to this. When did you check, 15 years ago? Oracle has direct-path import/export and data pump; both of which make generic COPY look like a turtle. The new PostgreSQL bulk-loader takes similar concepts from Oracle and is fairly faster than COPY. Don't get me wrong, I'm pro-PostgreSQL... but spouting personal observations on other databases as facts just boasts an PostgreSQL-centric egotistical view of the world. If you don't tune Oracle, it will suck. If you don't understand Oracle architecture when you tune an application, it will suck; just like PostgreSQL. People who don't have extensive experience in the other databases just hear what you say and regurgitate it as fact; which it is not. Look at how many people in these lists still go on and on about MySQL flaws based on their experience with MySQL 3.23. Times change and it doesn't do anyone any good to be ignorant of other databases. If you're going to speak about another database in a comparison, please stay current or specify the database you're comparing against. This is nothing against you, but it always starts an avalanche of, "look how perfect we are compared to everyone else." -- Jonah H. Harris, Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/ ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org |
| |||
| On Wed, Jun 06, 2007 at 12:06:09AM +0200, Steinar H. Gunderson wrote: > Wasn't there also talk about adding the ability to mark individual > partitions as read-only, thus bypassing MVCC and allowing queries > to be satisfied using indexes only? I have a (different) problem that read-only data segments (maybe partitions, maybe something else) would help, so I know for sure that someone is working on a problem like this, but I don't think it's the sort of thing that's going to come any time soon. A -- Andrew Sullivan | ajs@crankycanuck.ca I remember when computers were frustrating because they *did* exactly what you told them to. That actually seems sort of quaint now. --J.D. Baldwin ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| ||||
| On Tue, Jun 05, 2007 at 03:31:55PM -0700, david@lang.hm wrote: > various people (not database experts) are pushing to install Oracle > cluster so that they can move all of these to one table with a customerID > column. Well, you will always have to deal with the sort of people who will base their technical prescriptions on the shiny ads they read in SuperGlobalNetworkedExecutiveGoFast, or whatever rag they're reading these days. I usually encourage such people actually to perform the analysis of the license, salary, contingency, and migrations costs (and do a similar analysis myself, actually, so when they have overlooked the 30 things that individually cost $1million a piece, I can point them out). More than one jaw has had to be picked up off the floor when presented with the bill for RAC. Frequently, people discover that it is a good way to turn your tidy money-making enterprise into a giant money hole that produces a sucking sound on the other end of which is Oracle Corporation. All of that aside, I have pretty severe doubts that RAC would be a win for you. A big honkin' single database in Postgres ought to be able to do this too, if you throw enough hardware money at it. But it seems a waste to re-implement something that's already apparently working for you in favour of something more expensive that you don't seem to need. A -- Andrew Sullivan | ajs@crankycanuck.ca When my information changes, I alter my conclusions. What do you do sir? --attr. John Maynard Keynes ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate |