Unix Technical Forum

Thousands of tables versus on table?

This is a discussion on Thousands of tables versus on table? within the Pgsql Performance forums, part of the PostgreSQL category; --> david@lang.hm wrote: > On Wed, 6 Jun 2007, Steinar H. Gunderson wrote: > >> On Tue, Jun 05, 2007 ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-19-2008, 10:56 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Thousands of tables versus on table?

david@lang.hm wrote:
> On Wed, 6 Jun 2007, Steinar H. Gunderson wrote:
>
>> On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote:
>>> I think the main argument for partitioning is when you are interested in
>>> being able to drop whole partitions cheaply.

>>
>> Wasn't there also talk about adding the ability to mark individual
>> partitions
>> as read-only, thus bypassing MVCC and allowing queries to be satisfied
>> using
>> indexes only?
>>
>> Not that I think I've seen it on the TODO... :-)

>
> now that's a very interesting idea, especially when combined with
> time-based data where the old times will never change.


That's been discussed, but it's controversial. IMHO a better way to
achieve that is to design the dead-space-map so that it can be used to
check which parts of a table are visible to everyone, and skip
visibility checks. That doesn't require any user action, and allows updates.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-19-2008, 10:56 AM
Scott Marlowe
 
Posts: n/a
Default Re: Thousands of tables versus on table?

david@lang.hm wrote:
> On Tue, 5 Jun 2007, Tom Lane wrote:
>
>> david@lang.hm writes:
>>> however I really don't understand why it is more efficiant to have a 5B
>>> line table that you do a report/query against 0.1% of then it is to
>>> have
>>> 1000 different tables of 5M lines each and do a report/query against
>>> 100%
>>> of.

>>
>> Essentially what you are doing when you do that is taking the top few
>> levels of the index out of the database and putting it into the
>> filesystem; plus creating duplicative indexing information in the
>> database's system catalogs.
>>
>> The degree to which this is a win is *highly* debatable, and certainly
>> depends on a whole lot of assumptions about filesystem performance.
>> You also need to assume that constraint-exclusion in the planner is
>> pretty doggone cheap relative to the table searches, which means it
>> almost certainly will lose badly if you carry the subdivision out to
>> the extent that the individual tables become small. (This last could

> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> what is considered 'small'? a few thousand records, a few million
> records?


I would say small is when the individual tables are in the 10 to 20
Megabyte range. How many records that is depends on record width, of
course. Basically, once the tables get small enough that you don't
really need indexes much, since you tend to grab 25% or more of each one
that you're going to hit in a query.

> what multiplication factor would there need to be on the partitioning
> to make it worth while? 100 tables, 1000 tables, 10000 tables?

Really depends on the size of the master table I think. If the master
table is about 500 Megs in size, and you partition it down to about 1
meg per child table, you're probably ok. Walking through 500 entries
for constraint exclusion seems pretty speedy from the tests I've run on
a 12M row table that was about 250 Megs, split into 200 to 400 or so
equisized child tables. The time to retrieve 85,000 rows that were all
neighbors went from 2 to 6 seconds, to about 0.2 seconds, and we got rid
of indexes entirely since they weren't really needed anymore.

> the company that I'm at started out with a seperate database per
> customer (not useing postgres), there are basicly zero cross-customer
> queries, with a large volume of updates and lookups.
>
> overall things have now grown to millions of updates/day (some
> multiple of this in lookups), and ~2000 customers, with tens of
> millions of rows between them.
>
> having each one as a seperate database has really helped us over the
> years as it's made it easy to scale (run 500 databases on each server
> instead of 1000, performance just doubled)

I think that for what you're doing, partitioning at the database level
is probably a pretty good compromise solution. Like you say, it's easy
to put busy databases on a new server to balance out the load. Hardware
is cheap.

> various people (not database experts) are pushing to install Oracle
> cluster so that they can move all of these to one table with a
> customerID column.

Have these people identified a particular problem they're trying to
solve, or is this a religious issue for them? From your description it
sounds like a matter of dogma, not problem solving.
> the database folks won't comment much on this either way, but they
> don't seem enthusiastic to combine all the data togeather.

I think they can see the fecal matter heading towards the rotational
cooling device on this one. I can't imagine this being a win from the
perspective of saving the company money.


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-19-2008, 10:56 AM
Scott Marlowe
 
Posts: n/a
Default Re: Thousands of tables versus on table?

Tom Lane wrote:
> The degree to which this is a win is *highly* debatable, and certainly
> depends on a whole lot of assumptions about filesystem performance.
> You also need to assume that constraint-exclusion in the planner is
> pretty doggone cheap relative to the table searches, which means it
> almost certainly will lose badly if you carry the subdivision out to
> the extent that the individual tables become small. (This last could
> be improved in some cases if we had a more explicit representation of
> partitioning, but it'll never be as cheap as one more level of index
> search.)

I did some testing a while back on some of this, and with 400 or so
partitions, the select time was still very fast.

We were testing grabbing 50-80k rows from 12M at a time, all adjacent to
each other. With the one big table and one big two way index method, we
were getting linearly increasing select times as the dataset grew larger
and larger. The indexes were much larger than available memory and
shared buffers. The retrieval time for 50-80k rows was on the order of
2 to 6 seconds, while the retrieval time for the same number of rows
with 400 partitions was about 0.2 to 0.5 seconds.

I haven't tested with more partitions than that, but might if I get a
chance. What was really slow was the inserts since I was using rules at
the time. I'd like to try re-writing it to use triggers, since I would
then have one trigger on the parent table instead of 400 rules. Or I
could imbed the rules into the app that was creating / inserting the
data. The insert performance dropped off VERY fast as I went over 100
rules, and that was what primarily stopped me from testing larger
numbers of partitions.

The select performance stayed very fast with more partitions, so I'm
guessing that the constraint exclusion is pretty well optimized.

I'll play with it some more when I get a chance. For certain operations
like the one we were testing, partitioning seems to pay off big time.

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-19-2008, 10:56 AM
Craig James
 
Posts: n/a
Default Re: Thousands of tables versus on table?

david@lang.hm wrote:
> various people (not database experts) are pushing to install Oracle
> cluster so that they can move all of these to one table with a
> customerID column.


They're blowing smoke if they think Oracle can do this. One of my applications had this exact same problem -- table-per-customer versus big-table-for-everyone. Oracle fell over dead, even with the best indexing possible, tuned by the experts, and using partitions keyed to the customerID.

We ended up breaking it up into table-per-customer because Oracle fell over dead when we had to do a big update on a customer's entire dataset. All other operations were slowed by the additional index on the customer-ID, especially complex joins. With a table-for-everyone, you're forced to create tricky partitioning or clustering, clever indexes, and even with that, big updates are problematic. And once you do this, then you become heavily tied to one RDBMS and your applications are no longer portable, because clustering, indexing, partitioning and other DB tuning tricks are very specific to each RDBMS.

When we moved to Postgres, we never revisited this issue, because both Oracle and Postgres are able to handle thousands of tables well. As I wrote in a previous message on a different topic, often the design of your application is more important than the performance. In our case, the table-per-customer makes the applications simpler, and security is MUCH easier.

Oracle is simply not better than Postgres in this regard. As far as I know, there is only one specific situation (discussed frequently here) where Oracle is faster: the count(), min() and max() functions, and I know significant progress has been made since I started using Postgres. I have not found any other query where Oracle is significantly better, and I've found several where Postgres is the clear winner.

It's telling that Oracle's license contract prohibits you from publishing comparisons and benchmarks. You have to wonder why.

Craig

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-19-2008, 10:56 AM
Scott Marlowe
 
Posts: n/a
Default Re: Thousands of tables versus on table?

Craig James wrote:
>
> Oracle is simply not better than Postgres in this regard. As far as I
> know, there is only one specific situation (discussed frequently here)
> where Oracle is faster: the count(), min() and max() functions, and I
> know significant progress has been made since I started using
> Postgres. I have not found any other query where Oracle is
> significantly better, and I've found several where Postgres is the
> clear winner.

In my testing between a commercial database that cannot be named and
postgresql, I found max() / min() to be basically the same, even with
where clauses and joins happening.

count(*), OTOH, is a still a clear winner for the big commercial
database. With smaller sets (1 Million or so) both dbs are in the same
ballpark.

With 30+million rows, count(*) took 2 minutes on pgsql and 4 seconds on
the big database.

OTOH, there are some things, like importing data, which are MUCH faster
in pgsql than in the big database.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-19-2008, 10:56 AM
Craig James
 
Posts: n/a
Default Re: Thousands of tables versus on table?

Scott Marlowe wrote:
> OTOH, there are some things, like importing data, which are MUCH faster
> in pgsql than in the big database.


An excellent point, I forgot about this. The COPY command is the best thing since the invention of a shirt pocket. We have a database-per-customer design, and one of the mosterous advantages of Postgres is that we can easily do backups. A pg_dump, then scp to a backup server, and in just a minute or two we have a full backup. For recovery, pg_restore is equally fast and amazing. Last time I checked, Oracle didn't have anything close to this.

Craig



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-19-2008, 10:56 AM
Jonah H. Harris
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On 6/6/07, Craig James <craig_james@emolecules.com> wrote:
> They're blowing smoke if they think Oracle can do this.


Oracle could handle this fine.

> Oracle fell over dead, even with the best indexing possible,
> tuned by the experts, and using partitions keyed to the
> customerID.


I don't think so, whoever tuned this likely didn't know what they were doing.

> It's telling that Oracle's license contract prohibits you from
> publishing comparisons and benchmarks. You have to wonder why.


They did this for the same reason as everyone else. They don't want
non-experts tuning the database incorrectly, writing a benchmark paper
about it, and making the software look bad.

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-19-2008, 10:56 AM
Jonah H. Harris
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On 6/6/07, Craig James <craig_james@emolecules.com> wrote:
> Last time I checked, Oracle didn't have anything close to this.


When did you check, 15 years ago? Oracle has direct-path
import/export and data pump; both of which make generic COPY look like
a turtle. The new PostgreSQL bulk-loader takes similar concepts from
Oracle and is fairly faster than COPY.

Don't get me wrong, I'm pro-PostgreSQL... but spouting personal
observations on other databases as facts just boasts an
PostgreSQL-centric egotistical view of the world. If you don't tune
Oracle, it will suck. If you don't understand Oracle architecture
when you tune an application, it will suck; just like PostgreSQL.
People who don't have extensive experience in the other databases just
hear what you say and regurgitate it as fact; which it is not.

Look at how many people in these lists still go on and on about MySQL
flaws based on their experience with MySQL 3.23. Times change and it
doesn't do anyone any good to be ignorant of other databases. If
you're going to speak about another database in a comparison, please
stay current or specify the database you're comparing against.

This is nothing against you, but it always starts an avalanche of,
"look how perfect we are compared to everyone else."

--
Jonah H. Harris, Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 3rd Floor | jharris@enterprisedb.com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #29 (permalink)  
Old 04-19-2008, 10:56 AM
Andrew Sullivan
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On Wed, Jun 06, 2007 at 12:06:09AM +0200, Steinar H. Gunderson wrote:

> Wasn't there also talk about adding the ability to mark individual
> partitions as read-only, thus bypassing MVCC and allowing queries
> to be satisfied using indexes only?


I have a (different) problem that read-only data segments (maybe
partitions, maybe something else) would help, so I know for sure that
someone is working on a problem like this, but I don't think it's the
sort of thing that's going to come any time soon.

A

--
Andrew Sullivan | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to. That actually seems sort of quaint now.
--J.D. Baldwin

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #30 (permalink)  
Old 04-19-2008, 10:56 AM
Andrew Sullivan
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On Tue, Jun 05, 2007 at 03:31:55PM -0700, david@lang.hm wrote:
> various people (not database experts) are pushing to install Oracle
> cluster so that they can move all of these to one table with a customerID
> column.


Well, you will always have to deal with the sort of people who will
base their technical prescriptions on the shiny ads they read in
SuperGlobalNetworkedExecutiveGoFast, or whatever rag they're reading
these days. I usually encourage such people actually to perform the
analysis of the license, salary, contingency, and migrations costs
(and do a similar analysis myself, actually, so when they have
overlooked the 30 things that individually cost $1million a piece, I
can point them out). More than one jaw has had to be picked up off
the floor when presented with the bill for RAC. Frequently, people
discover that it is a good way to turn your tidy money-making
enterprise into a giant money hole that produces a sucking sound on
the other end of which is Oracle Corporation.

All of that aside, I have pretty severe doubts that RAC would be a
win for you. A big honkin' single database in Postgres ought to be
able to do this too, if you throw enough hardware money at it. But
it seems a waste to re-implement something that's already apparently
working for you in favour of something more expensive that you don't
seem to need.

A

--
Andrew Sullivan | ajs@crankycanuck.ca
When my information changes, I alter my conclusions. What do you do sir?
--attr. John Maynard Keynes

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 05:25 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com