Unix Technical Forum

Thousands of tables versus on table?

This is a discussion on Thousands of tables versus on table? within the Pgsql Performance forums, part of the PostgreSQL category; --> On Mon, 4 Jun 2007, Scott Marlowe wrote: > Gregory Stark wrote: >> "Thomas Andrews" <tandrews@soliantconsulting.com> writes: >> >> ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #11 (permalink)  
Old 04-19-2008, 10:55 AM
david@lang.hm
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On Mon, 4 Jun 2007, Scott Marlowe wrote:

> Gregory Stark wrote:
>> "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>
>>
>> > I guess my real question is, does it ever make sense to create thousands
>> > of
>> > tables like this?
>> >

>>
>> Sometimes. But usually it's not a good idea.
>>
>> What you're proposing is basically partitioning, though you may not
>> actually
>> need to put all the partitions together for your purposes. Partitioning's
>> main
>> benefit is in the management of the data. You can drop and load partitions
>> in
>> chunks rather than have to perform large operations on millions of
>> records.
>>
>> Postgres doesn't really get any faster by breaking the tables up like
>> that. In
>> fact it probably gets slower as it has to look up which of the thousands
>> of
>> tables you want to work with.
>>

>
> That's not entirely true. PostgreSQL can be markedly faster using
> partitioning as long as you always access it by referencing the partitioning
> key in the where clause. So, if you partition the table by date, and always
> reference it with a date in the where clause, it will usually be noticeably
> faster. OTOH, if you access it without using a where clause that lets it
> pick partitions, then it will be slower than one big table.
>
> So, while this poster might originally think to have one table for each user,
> resulting in thousands of tables, maybe a compromise where you partition on
> userid ranges would work out well, and keep each partition table down to some
> 50-100 thousand rows, with smaller indexes to match.
>


what if he doesn't use the postgres internal partitioning, but instead
makes his code access the tables named responsesNNNNN where NNNNN is the
id of the customer?

this is what it sounded like he was asking initially.

David Lang

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #12 (permalink)  
Old 04-19-2008, 10:56 AM
Scott Marlowe
 
Posts: n/a
Default Re: Thousands of tables versus on table?

david@lang.hm wrote:
> On Mon, 4 Jun 2007, Scott Marlowe wrote:
>
>> Gregory Stark wrote:
>>> "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>>
>>>
>>> > I guess my real question is, does it ever make sense to create
>>> thousands > of
>>> > tables like this?
>>> >
>>> Sometimes. But usually it's not a good idea.
>>> What you're proposing is basically partitioning, though you may not
>>> actually
>>> need to put all the partitions together for your purposes.
>>> Partitioning's
>>> main
>>> benefit is in the management of the data. You can drop and load
>>> partitions
>>> in
>>> chunks rather than have to perform large operations on millions of
>>> records.
>>>
>>> Postgres doesn't really get any faster by breaking the tables up like
>>> that. In
>>> fact it probably gets slower as it has to look up which of the
>>> thousands
>>> of
>>> tables you want to work with.
>>>

>>
>> That's not entirely true. PostgreSQL can be markedly faster using
>> partitioning as long as you always access it by referencing the
>> partitioning key in the where clause. So, if you partition the table
>> by date, and always reference it with a date in the where clause, it
>> will usually be noticeably faster. OTOH, if you access it without
>> using a where clause that lets it pick partitions, then it will be
>> slower than one big table.
>>
>> So, while this poster might originally think to have one table for
>> each user, resulting in thousands of tables, maybe a compromise where
>> you partition on userid ranges would work out well, and keep each
>> partition table down to some 50-100 thousand rows, with smaller
>> indexes to match.
>>

>
> what if he doesn't use the postgres internal partitioning, but instead
> makes his code access the tables named responsesNNNNN where NNNNN is
> the id of the customer?
>
> this is what it sounded like he was asking initially.


Sorry, I think I initially read your response as "Postgres doesn't
really get any faster by breaking the tables up" without the "like that"
part.

I've found that as long as the number of tables is > 10,000 or so,
having a lot of tables doesn't seem to really slow pgsql down a lot.
I'm sure that the tipping point is dependent on your db machine. I
would bet that if he's referring to individual tables directly, and each
one has hundreds instead of millions of rows, the performance would be
better. But the only way to be sure is to test it.

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #13 (permalink)  
Old 04-19-2008, 10:56 AM
Thomas Andrews
 
Posts: n/a
Default Re: Thousands of tables versus on table?

So, partitioning in PSQL 8 is workable, but breaking up the table up into
actual separate tables is not?

Another solution we have proposed is having 'active' and 'completed' tables.
So, rather than thousands, we'd have four tables:

responders_active
responders_completed
responses_active
responses_completed

That way, the number of responses_active records would not be as huge. The
problem, as we see it, is that the responders are entering their responses
and it is taking too long. But if we separate out active and completed
surveys, then the inserts will likely cost less. We might even be able to
reduce the indices on the _active tables because survey administrators would
not want to run as many complex reports on the active responses.

There would be an extra cost, when the survey is completed, of copying the
records from the '_active' table to the '_completed' table and then deleting
them, but that operation is something a survey administrator would be
willing to accept as taking a while (as well as something we could put off
to an off hour, although we have lots of international customers so it's not
clear when our off hours are.)

=thomas


On 6/5/07 12:48 PM, "Scott Marlowe" <smarlowe@g2switchworks.com> wrote:

> david@lang.hm wrote:
>> On Mon, 4 Jun 2007, Scott Marlowe wrote:
>>
>>> Gregory Stark wrote:
>>>> "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>>>
>>>>
>>>>> I guess my real question is, does it ever make sense to create
>>>> thousands > of
>>>>> tables like this?
>>>>>
>>>> Sometimes. But usually it's not a good idea.
>>>> What you're proposing is basically partitioning, though you may not
>>>> actually
>>>> need to put all the partitions together for your purposes.
>>>> Partitioning's
>>>> main
>>>> benefit is in the management of the data. You can drop and load
>>>> partitions
>>>> in
>>>> chunks rather than have to perform large operations on millions of
>>>> records.
>>>>
>>>> Postgres doesn't really get any faster by breaking the tables up like
>>>> that. In
>>>> fact it probably gets slower as it has to look up which of the
>>>> thousands
>>>> of
>>>> tables you want to work with.
>>>>
>>>
>>> That's not entirely true. PostgreSQL can be markedly faster using
>>> partitioning as long as you always access it by referencing the
>>> partitioning key in the where clause. So, if you partition the table
>>> by date, and always reference it with a date in the where clause, it
>>> will usually be noticeably faster. OTOH, if you access it without
>>> using a where clause that lets it pick partitions, then it will be
>>> slower than one big table.
>>>
>>> So, while this poster might originally think to have one table for
>>> each user, resulting in thousands of tables, maybe a compromise where
>>> you partition on userid ranges would work out well, and keep each
>>> partition table down to some 50-100 thousand rows, with smaller
>>> indexes to match.
>>>

>>
>> what if he doesn't use the postgres internal partitioning, but instead
>> makes his code access the tables named responsesNNNNN where NNNNN is
>> the id of the customer?
>>
>> this is what it sounded like he was asking initially.

>
> Sorry, I think I initially read your response as "Postgres doesn't
> really get any faster by breaking the tables up" without the "like that"
> part.
>
> I've found that as long as the number of tables is > 10,000 or so,
> having a lot of tables doesn't seem to really slow pgsql down a lot.
> I'm sure that the tipping point is dependent on your db machine. I
> would bet that if he's referring to individual tables directly, and each
> one has hundreds instead of millions of rows, the performance would be
> better. But the only way to be sure is to test it.



---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #14 (permalink)  
Old 04-19-2008, 10:56 AM
Scott Marlowe
 
Posts: n/a
Default Re: Thousands of tables versus on table?

Thomas Andrews wrote:
>
>
> On 6/5/07 12:48 PM, "Scott Marlowe" <smarlowe@g2switchworks.com> wrote:
>
>
>> david@lang.hm wrote:
>>
>>> On Mon, 4 Jun 2007, Scott Marlowe wrote:
>>>
>>>
>>>> Gregory Stark wrote:
>>>>
>>>>> "Thomas Andrews" <tandrews@soliantconsulting.com> writes:
>>>>>
>>>>>
>>>>>
>>>>>> I guess my real question is, does it ever make sense to create
>>>>>>
>>>>> thousands > of
>>>>>
>>>>>> tables like this?
>>>>>>
>>>>>>
>>>>> Sometimes. But usually it's not a good idea.
>>>>> What you're proposing is basically partitioning, though you may not
>>>>> actually
>>>>> need to put all the partitions together for your purposes.
>>>>> Partitioning's
>>>>> main
>>>>> benefit is in the management of the data. You can drop and load
>>>>> partitions
>>>>> in
>>>>> chunks rather than have to perform large operations on millions of
>>>>> records.
>>>>>
>>>>> Postgres doesn't really get any faster by breaking the tables up like
>>>>> that. In
>>>>> fact it probably gets slower as it has to look up which of the
>>>>> thousands
>>>>> of
>>>>> tables you want to work with.
>>>>>
>>>>>
>>>> That's not entirely true. PostgreSQL can be markedly faster using
>>>> partitioning as long as you always access it by referencing the
>>>> partitioning key in the where clause. So, if you partition the table
>>>> by date, and always reference it with a date in the where clause, it
>>>> will usually be noticeably faster. OTOH, if you access it without
>>>> using a where clause that lets it pick partitions, then it will be
>>>> slower than one big table.
>>>>
>>>> So, while this poster might originally think to have one table for
>>>> each user, resulting in thousands of tables, maybe a compromise where
>>>> you partition on userid ranges would work out well, and keep each
>>>> partition table down to some 50-100 thousand rows, with smaller
>>>> indexes to match.
>>>>
>>>>
>>> what if he doesn't use the postgres internal partitioning, but instead
>>> makes his code access the tables named responsesNNNNN where NNNNN is
>>> the id of the customer?
>>>
>>> this is what it sounded like he was asking initially.
>>>

>> Sorry, I think I initially read your response as "Postgres doesn't
>> really get any faster by breaking the tables up" without the "like that"
>> part.
>>
>> I've found that as long as the number of tables is > 10,000 or so,
>>

That should have been as long as the number of tables is < 10,000 or so...

>> having a lot of tables doesn't seem to really slow pgsql down a lot.
>> I'm sure that the tipping point is dependent on your db machine. I
>> would bet that if he's referring to individual tables directly, and each
>> one has hundreds instead of millions of rows, the performance would be
>> better. But the only way to be sure is to test it.
>>

>
>

Please stop top posting. This is my last reply until you stop top posting.

> So, partitioning in PSQL 8 is workable, but breaking up the table up into
> actual separate tables is not?
>

Ummm, that's not what I said. They're similar in execution. However,
partitioning might let you put 100 customers into a given table, if,
say, you partitioned on customer ID or something that would allow you to
group a few together.
> Another solution we have proposed is having 'active' and 'completed' tables.
> So, rather than thousands, we'd have four tables:
>
> responders_active
> responders_completed
> responses_active
> responses_completed
>

That's not a bad idea. Just keep up on your vacuuming.

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #15 (permalink)  
Old 04-19-2008, 10:56 AM
Scott Marlowe
 
Posts: n/a
Default Re: Thousands of tables versus on table?

Gregory Stark wrote:
> "Scott Marlowe" <smarlowe@g2switchworks.com> writes:
>
>
>> Sorry, I think I initially read your response as "Postgres doesn't really get
>> any faster by breaking the tables up" without the "like that" part.
>>

>
> Well breaking up the tables like that or partitioning, either way should be
> about equivalent really. Breaking up the tables and doing it in the
> application should perform even better but it does make the schema less
> flexible and harder to do non-partition based queries and so on.
>

True, but we can break it up by something other than the company name on
the survey, in this instance, and might find it far easier to manage by,
say, date range, company ID range, etc...
Plus with a few hand rolled bash or perl scripts we can maintain our
database and keep all the logic of partitioning out of our app. Which
would allow developers not wholly conversant in our partitioning scheme
to participate in development without the fear of them putting data in
the wrong place.
> Where the win in partitioning comes in is in being able to disappear some of
> the data entirely. By making part of the index key implicit in the choice of
> partition you get away with a key that's half as large. And in some cases you
> can get away with using a different key entirely which wouldn't otherwise have
> been feasible to index. In some cases you can even do sequential scans whereas
> in an unpartitioned table you would have to use an index (or scan the entire
> table).
>

Yeah, I found that out recently while I benchmarking a 12,000,000 row
geometric data set. Breaking it into 400 or so partitions resulted in
no need for indexes and response times of 0.2 or so seconds, where
before that I'd been in the 1.5 to 3 second range.

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #16 (permalink)  
Old 04-19-2008, 10:56 AM
david@lang.hm
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On Tue, 5 Jun 2007, Gregory Stark wrote:

> "Scott Marlowe" <smarlowe@g2switchworks.com> writes:
>
>> Sorry, I think I initially read your response as "Postgres doesn't really get
>> any faster by breaking the tables up" without the "like that" part.

>
> Well breaking up the tables like that or partitioning, either way should be
> about equivalent really. Breaking up the tables and doing it in the
> application should perform even better but it does make the schema less
> flexible and harder to do non-partition based queries and so on.


but he said in the initial message that they don't do cross-customer
reports anyway, so there really isn't any non-partition based querying
going on anyway.

> I guess I should explain what I originally meant: A lot of people come from a
> flat-file world and assume that things get slower when you deal with large
> tables. In fact due to the magic of log(n) accessing records from a large
> index is faster than first looking up the table and index info in a small
> index and then doing a second lookup in up in an index for a table half the
> size.


however, if your query plan every does a sequential scan of a table then
you are nog doing a log(n) lookup are you?

> Where the win in partitioning comes in is in being able to disappear some of
> the data entirely. By making part of the index key implicit in the choice of
> partition you get away with a key that's half as large. And in some cases you
> can get away with using a different key entirely which wouldn't otherwise have
> been feasible to index. In some cases you can even do sequential scans whereas
> in an unpartitioned table you would have to use an index (or scan the entire
> table).
>
> But the real reason people partition data is really for the management ease.
> Being able to drop, and load entire partitions in O(1) is makes it feasible to
> manage data on a scale that would simply be impossible without partitioned
> tables.


remember that the origional question wasn't about partitioned tables, it
was about the performance problem he was having with one large table (slow
insert speed) and asking if postgres would collapse if he changed his
schema to use a seperate table per customer.

I see many cases where people advocate collapsing databases/tables
togeather by adding a column that indicates which customer the line is
for.

however I really don't understand why it is more efficiant to have a 5B
line table that you do a report/query against 0.1% of then it is to have
1000 different tables of 5M lines each and do a report/query against 100%
of. it would seem that the fact that you don't have to skip over 99.9% of
the data to find things that _may_ be relavent would have a noticable cost
in and of itself.

David Lang

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #17 (permalink)  
Old 04-19-2008, 10:56 AM
Tom Lane
 
Posts: n/a
Default Re: Thousands of tables versus on table?

david@lang.hm writes:
> however I really don't understand why it is more efficiant to have a 5B
> line table that you do a report/query against 0.1% of then it is to have
> 1000 different tables of 5M lines each and do a report/query against 100%
> of.


Essentially what you are doing when you do that is taking the top few
levels of the index out of the database and putting it into the
filesystem; plus creating duplicative indexing information in the
database's system catalogs.

The degree to which this is a win is *highly* debatable, and certainly
depends on a whole lot of assumptions about filesystem performance.
You also need to assume that constraint-exclusion in the planner is
pretty doggone cheap relative to the table searches, which means it
almost certainly will lose badly if you carry the subdivision out to
the extent that the individual tables become small. (This last could
be improved in some cases if we had a more explicit representation of
partitioning, but it'll never be as cheap as one more level of index
search.)

I think the main argument for partitioning is when you are interested in
being able to drop whole partitions cheaply.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #18 (permalink)  
Old 04-19-2008, 10:56 AM
Steinar H. Gunderson
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote:
> I think the main argument for partitioning is when you are interested in
> being able to drop whole partitions cheaply.


Wasn't there also talk about adding the ability to mark individual partitions
as read-only, thus bypassing MVCC and allowing queries to be satisfied using
indexes only?

Not that I think I've seen it on the TODO... :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #19 (permalink)  
Old 04-19-2008, 10:56 AM
david@lang.hm
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On Tue, 5 Jun 2007, Tom Lane wrote:

> david@lang.hm writes:
>> however I really don't understand why it is more efficiant to have a 5B
>> line table that you do a report/query against 0.1% of then it is to have
>> 1000 different tables of 5M lines each and do a report/query against 100%
>> of.

>
> Essentially what you are doing when you do that is taking the top few
> levels of the index out of the database and putting it into the
> filesystem; plus creating duplicative indexing information in the
> database's system catalogs.
>
> The degree to which this is a win is *highly* debatable, and certainly
> depends on a whole lot of assumptions about filesystem performance.
> You also need to assume that constraint-exclusion in the planner is
> pretty doggone cheap relative to the table searches, which means it
> almost certainly will lose badly if you carry the subdivision out to
> the extent that the individual tables become small. (This last could

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
what is considered 'small'? a few thousand records, a few million records?

what multiplication factor would there need to be on the partitioning to
make it worth while? 100 tables, 1000 tables, 10000 tables?

the company that I'm at started out with a seperate database per customer
(not useing postgres), there are basicly zero cross-customer queries, with
a large volume of updates and lookups.

overall things have now grown to millions of updates/day (some multiple of
this in lookups), and ~2000 customers, with tens of millions of rows
between them.

having each one as a seperate database has really helped us over the years
as it's made it easy to scale (run 500 databases on each server instead of
1000, performance just doubled)

various people (not database experts) are pushing to install Oracle
cluster so that they can move all of these to one table with a customerID
column.

the database folks won't comment much on this either way, but they don't
seem enthusiastic to combine all the data togeather.

I've been on the side of things that said that seperate databases is
better becouse it improves data locality to only have to look at the data
for one customer at a time rather then having to pick out that customer's
data out from the mass of other, unrelated data.

> be improved in some cases if we had a more explicit representation of
> partitioning, but it'll never be as cheap as one more level of index
> search.)


say you have a billing table of
customerID, date, description, amount, tax, extended, paid

and you need to do things like
report on invoices that haven't been paied
summarize the amount billed each month
summarize the tax for each month

but you need to do this seperately for each customerID (not as a batch job
that reports on all customerID's at once, think a website where the
customer can request such reports at any time with a large variation in
criteria)

would you be able to just have one index on customerID and then another on
date? or would the second one need to be on customerID||date?

and would this process of going throught he index and seeking to the data
it points to really be faster then a sequential scan of just the data
related to that customerID?

> I think the main argument for partitioning is when you are interested in
> being able to drop whole partitions cheaply.


I fully understand this if you are doing queries across all the
partitions, but if your query is confined to a single partition,
especially in the case where you know ahead of time in the application
which 'partition' you care about it would seem that searching through
significantly less data should be a win.

David Lang

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #20 (permalink)  
Old 04-19-2008, 10:56 AM
david@lang.hm
 
Posts: n/a
Default Re: Thousands of tables versus on table?

On Wed, 6 Jun 2007, Steinar H. Gunderson wrote:

> On Tue, Jun 05, 2007 at 05:59:25PM -0400, Tom Lane wrote:
>> I think the main argument for partitioning is when you are interested in
>> being able to drop whole partitions cheaply.

>
> Wasn't there also talk about adding the ability to mark individual partitions
> as read-only, thus bypassing MVCC and allowing queries to be satisfied using
> indexes only?
>
> Not that I think I've seen it on the TODO... :-)


now that's a very interesting idea, especially when combined with
time-based data where the old times will never change.

David Lang

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 04:57 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com