Unix Technical Forum

TB-sized databases

This is a discussion on TB-sized databases within the Pgsql Performance forums, part of the PostgreSQL category; --> On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote: > In fact an even more useful option would be ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-19-2008, 11:42 AM
Simon Riggs
 
Posts: n/a
Default Re: TB-sized databases

On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:

> In fact an even more useful option would be to ask the planner to throw
> error if the expected cost exceeds a certain threshold...


Well, I've suggested it before:

statement_cost_limit on pgsql-hackers, 1 March 2006

Would people like me to re-write and resubmit this patch for 8.4?

Tom's previous concerns were along the lines of "How would know what to
set it to?", given that the planner costs are mostly arbitrary numbers.

Any bright ideas, or is it we want it and we don't care about the
possible difficulties?

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-19-2008, 11:42 AM
Simon Riggs
 
Posts: n/a
Default Re: TB-sized databases

On Tue, 2007-11-27 at 18:06 -0500, Pablo Alcaraz wrote:
> Simon Riggs wrote:
> > All of those responses have cooked up quite a few topics into one. Large
> > databases might mean text warehouses, XML message stores, relational
> > archives and fact-based business data warehouses.
> >
> > The main thing is that TB-sized databases are performance critical. So
> > it all depends upon your workload really as to how well PostgreSQL, or
> > another other RDBMS vendor can handle them.
> >
> >
> > Anyway, my reason for replying to this thread is that I'm planning
> > changes for PostgreSQL 8.4+ that will make allow us to get bigger and
> > faster databases. If anybody has specific concerns then I'd like to hear
> > them so I can consider those things in the planning stages

> it would be nice to do something with selects so we can recover a rowset
> on huge tables using a criteria with indexes without fall running a full
> scan.
>
> In my opinion, by definition, a huge database sooner or later will have
> tables far bigger than RAM available (same for their indexes). I think
> the queries need to be solved using indexes enough smart to be fast on disk.


OK, I agree with this one.

I'd thought that index-only plans were only for OLTP, but now I see they
can also make a big difference with DW queries. So I'm very interested
in this area now.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-19-2008, 11:42 AM
Matthew
 
Posts: n/a
Default Re: TB-sized databases

On Wed, 28 Nov 2007, Simon Riggs wrote:
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?


Yes please. The more options, the better.

> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.
>
> Any bright ideas, or is it we want it and we don't care about the
> possible difficulties?


I think this is something that the average person should just knuckle down
and work out.

At the moment on my work's system, we call EXPLAIN before queries to find
out if it will take too long. This would improve performance by stopping
us having to pass the query into the query planner twice.

Matthew

--
An ant doesn't have a lot of processing power available to it. I'm not trying
to be speciesist - I wouldn't want to detract you from such a wonderful
creature, but, well, there isn't a lot there, is there?
-- Computer Science Lecturer

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-19-2008, 11:42 AM
david@lang.hm
 
Posts: n/a
Default Re: TB-sized databases

On Wed, 28 Nov 2007, Simon Riggs wrote:

> On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:
>
>> In fact an even more useful option would be to ask the planner to throw
>> error if the expected cost exceeds a certain threshold...

>
> Well, I've suggested it before:
>
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?
>
> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.


arbitrary numbers are fine if they are relativly consistant with each
other.

will a plan with a estimated cost of 1,000,000 take approximatly 100 times
as long as one with a cost of 10,000?

or more importantly, will a plan with an estimated cost of 2000 reliably
take longer then one with an estimated cost of 1000?

David Lang

> Any bright ideas, or is it we want it and we don't care about the
> possible difficulties?
>
>


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-19-2008, 11:42 AM
Gregory Stark
 
Posts: n/a
Default Re: TB-sized databases

"Simon Riggs" <simon@2ndquadrant.com> writes:

> On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:
>
>> In fact an even more useful option would be to ask the planner to throw
>> error if the expected cost exceeds a certain threshold...

>
> Well, I've suggested it before:
>
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?
>
> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.


Hm, that's only kind of true.

Since 8.mumble seq_page_cost is itself configurable meaning you can adjust the
base unit and calibrate all the parameters to be time in whatever unit you
choose.

But even assuming you haven't so adjusted seq_page_cost and all the other
parameters to match the numbers aren't entirely arbitrary. They represent time
in units of "however long a single sequential page read takes".

Obviously few people know how long such a page read takes but surely you would
just run a few sequential reads of large tables and set the limit to some
multiple of whatever you find.

This isn't going to precise to the level of being able to avoid executing any
query which will take over 1000ms. But it is going to be able to catch
unconstrained cross joins or large sequential scans or such.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-19-2008, 11:42 AM
Tom Lane
 
Posts: n/a
Default Re: TB-sized databases

Gregory Stark <stark@enterprisedb.com> writes:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
>> Tom's previous concerns were along the lines of "How would know what to
>> set it to?", given that the planner costs are mostly arbitrary numbers.


> Hm, that's only kind of true.


The units are not the problem. The problem is that you are staking
non-failure of your application on the planner's estimates being
pretty well in line with reality. Not merely in line enough that
it picks a reasonably cheap plan, but in line enough that if it
thinks plan A is 10x more expensive than plan B, then the actual
ratio is indeed somewhere near 10.

Given that this list spends all day every day discussing cases where the
planner is wrong, I'd have to think that that's a bet I wouldn't take.

You could probably avoid this risk by setting the cutoff at something
like 100 or 1000 times what you really want to tolerate, but how
useful is it then?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-19-2008, 11:42 AM
Simon Riggs
 
Posts: n/a
Default Re: TB-sized databases

On Thu, 2007-11-29 at 10:45 -0500, Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> >> Tom's previous concerns were along the lines of "How would know what to
> >> set it to?", given that the planner costs are mostly arbitrary numbers.

>
> > Hm, that's only kind of true.

>
> The units are not the problem. The problem is that you are staking
> non-failure of your application on the planner's estimates being
> pretty well in line with reality. Not merely in line enough that
> it picks a reasonably cheap plan, but in line enough that if it
> thinks plan A is 10x more expensive than plan B, then the actual
> ratio is indeed somewhere near 10.
>
> Given that this list spends all day every day discussing cases where the
> planner is wrong, I'd have to think that that's a bet I wouldn't take.


I think you have a point, but the alternative is often much worse.

If an SQL statement fails because of too high cost, we can investigate
the problem and re-submit. If a website slows down because somebody
allowed a very large query to execute then everybody is affected, not
just the person who ran the bad query. Either way the guy that ran the
query loses, but without constraints in place one guy can kill everybody
else also.

> You could probably avoid this risk by setting the cutoff at something
> like 100 or 1000 times what you really want to tolerate, but how
> useful is it then?


Still fairly useful, as long as we understand its a blunt instrument.

If the whole performance of your system depends upon indexed access then
rogue queries can have disastrous, unpredictable consequences. Many
sites construct their SQL dynamically, so a mistake in a seldom used
code path can allow killer queries through. Even the best DBAs have been
known to make mistakes.

e.g. An 80GB table has 8 million blocks in it.
- So putting a statement_cost limit = 1 million would allow some fairly
large queries but prevent anything that did a SeqScan (or worse).
- Setting it 10 million is going to prevent things like sorting the
whole table without a LIMIT
- Setting it at 100 million is going to prevent unconstrained product
joins etc..

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-19-2008, 11:42 AM
Stephen Frost
 
Posts: n/a
Default Re: TB-sized databases

* Simon Riggs (simon@2ndquadrant.com) wrote:
> On Thu, 2007-11-29 at 10:45 -0500, Tom Lane wrote:
> > Given that this list spends all day every day discussing cases where the
> > planner is wrong, I'd have to think that that's a bet I wouldn't take.

>
> I think you have a point, but the alternative is often much worse.


I'm not convinced you've outlined the consequences of implementing a
plan cost limit sufficiently.

> If an SQL statement fails because of too high cost, we can investigate
> the problem and re-submit. If a website slows down because somebody
> allowed a very large query to execute then everybody is affected, not
> just the person who ran the bad query. Either way the guy that ran the
> query loses, but without constraints in place one guy can kill everybody
> else also.


It's entirely possible (likely even) that most of the users accessing a
webpage are using the same queries and the same tables. If the estimates
for those tables ends up changing enough that PG adjusts the plan cost to
be above the plan cost limit then *all* of the users would be affected.

The plan cost isn't going to change for just one user if it's the same
query that a bunch of users are using. I'm not sure if handling the
true 'rougue query' case with this limit would actually be a net
improvment overall in a website-based situation.

I could see it being useful to set a 'notice_on_high_cost_query'
variable where someone working in a data warehouse situation would get a
notice if the query he's hand-crafting has a very high cost (in which
case he could ctrl-c it if he thinks something is wrong, rather than
waiting 5 hours before realizing he forgot a join clause), but the
website with the one rougue query run by one user seems a stretch.

Thanks,

Stephen

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFHTuwNrzgMPqB3kigRAvCTAJ9pOdRQYBqzRM53vjwpXi MOf5yBfwCghM/d
ur95/WD8krLq4sXiaJg+/rc=
=1LMH
-----END PGP SIGNATURE-----

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #29 (permalink)  
Old 04-19-2008, 11:42 AM
Csaba Nagy
 
Posts: n/a
Default Re: TB-sized databases

On Thu, 2007-11-29 at 10:45 -0500, Tom Lane wrote:
> Given that this list spends all day every day discussing cases where the
> planner is wrong, I'd have to think that that's a bet I wouldn't take.
>
> You could probably avoid this risk by setting the cutoff at something
> like 100 or 1000 times what you really want to tolerate, but how
> useful is it then?


It would still be useful in the sense that if the planner is taking
wrong estimates you must correct it somehow... raise statistics target,
rewrite query or other tweaking, you should do something. An error is
sometimes better than gradually decreasing performance because of too
low statistics target for example. So if the error is thrown because of
wrong estimate, it is still a valid error raising a signal that the DBA
has to do something about it.

It's still true that if the planner estimates too low, it will raise no
error and will take the resources. But that's just what we have now, so
it wouldn't be a regression of any kind...

Cheers,
Csaba.



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #30 (permalink)  
Old 04-19-2008, 11:42 AM
Mark Kirkwood
 
Posts: n/a
Default Re: TB-sized databases

Simon Riggs wrote:
> On Wed, 2007-11-28 at 14:48 +0100, Csaba Nagy wrote:
>
>
>> In fact an even more useful option would be to ask the planner to throw
>> error if the expected cost exceeds a certain threshold...
>>

>
> Well, I've suggested it before:
>
> statement_cost_limit on pgsql-hackers, 1 March 2006
>
> Would people like me to re-write and resubmit this patch for 8.4?
>
> Tom's previous concerns were along the lines of "How would know what to
> set it to?", given that the planner costs are mostly arbitrary numbers.
>
> Any bright ideas, or is it we want it and we don't care about the
> possible difficulties?
>
>


Knowing how to set it is a problem - but a possibly bigger one is that
monster query crippling your DW system, so I'd say lets have it.

Cheers

Mark

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:11 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com