Unix Technical Forum

Experiences with extensibility

This is a discussion on Experiences with extensibility within the Pgsql General forums, part of the PostgreSQL category; --> On 08.01.2008, at 23:20, Joshua D. Drake wrote: > That isn't really an extensibility argument. I was thinking about ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql General

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #21 (permalink)  
Old 04-10-2008, 12:17 AM
Guido Neitzer
 
Posts: n/a
Default Re: Experiences with extensibility

On 08.01.2008, at 23:20, Joshua D. Drake wrote:

> That isn't really an extensibility argument.


I was thinking about that too - for me, it still is just an
outstanding issue with PostgreSQL. It is incredibly scalable on one
machine but it totally sucks when you want more, but not much more.

Like, I have a situation where I need multi-master just for
availability. Two small servers are good enough for that. But
unfortunately with PostgreSQL the whole setup is a major pain in the ...

> At least not in my mind. Further I don't know of anyone that can
> "easily" do it. You either suffer the possibility of catastrophic
> data loss (dolphins) or you suffer guaranteed bank account drainage
> (Oracle), or you suffer the willingness of Monopolies (MSSQL).


FrontBase. It has an incredibly easy to configure replication and
multi master clustering support, is very reliable and can also handle
really big databases. The only problem is that the query planner is
not as good as PostgreSQL's so you might end up with much worse
performance. Depends a bit on the complexity of the database and how
"special" your queries are.

But if you need something easy to setup, multi-master with just two
machines, easy fail-over (done in the JDBC driver) without your
application even noticing it - try it. It's free, but not open source.
And it's a good product. I use it for some stuff and PostgreSQL for
other projects. Just depends on the requirements.

cug

--
http://www.event-s.net


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #22 (permalink)  
Old 04-10-2008, 12:17 AM
Joshua D. Drake
 
Posts: n/a
Default Re: Experiences with extensibility

Sim Zacks wrote:

> We use postgresql because it is open source, we have in-house experience
> to deal with it so we don't have any extra support costs and we don't
> need the features that are offered in commercial products that
> PostGreSQL does not have. We also don't need the speed that commercial
> products offer that is missing in PostgreSQL.


I use PostgreSQL because it has a ton of features the closed source
products don't offer and is generally faster than the closed source
solutions.

Granted there are scenarios where others are FASTER (SELECT COUNT(*))
but I find that if you are doing those items, you normally have a weird
design anyway.

Sincerely,

Joshua D. Drake


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #23 (permalink)  
Old 04-10-2008, 12:17 AM
Ivan Sergio Borgonovo
 
Posts: n/a
Default count(*) and bad design was: Experiences with extensibility

On Wed, 09 Jan 2008 00:06:45 -0800
"Joshua D. Drake" <jd@commandprompt.com> wrote:

> Granted there are scenarios where others are FASTER (SELECT
> COUNT(*)) but I find that if you are doing those items, you
> normally have a weird design anyway.


> Sincerely,


Sincerely, would you make an example of such a bad design?

Or did you just mean that count(*) is bad design in postgresql since
there are usually better alternatives in postgresql?

I'm not joking. I'd like to learn.

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #24 (permalink)  
Old 04-10-2008, 12:17 AM
Dann Corbit
 
Posts: n/a
Default Re: count(*) and bad design was: Experiences with extensibility

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailtogsql-general-
> owner@postgresql.org] On Behalf Of Ivan Sergio Borgonovo
> Sent: Wednesday, January 09, 2008 1:30 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] count(*) and bad design was: Experiences with
> extensibility
>
> On Wed, 09 Jan 2008 00:06:45 -0800
> "Joshua D. Drake" <jd@commandprompt.com> wrote:
>
> > Granted there are scenarios where others are FASTER (SELECT
> > COUNT(*)) but I find that if you are doing those items, you
> > normally have a weird design anyway.

>
> > Sincerely,

>
> Sincerely, would you make an example of such a bad design?


A program that estimates cardinality by doing SELECT COUNT(*) is a bad
design. Assuming you have the wherewithal to vacuum your tables (or
have autovacuum enabled) a query against the system tables will be a
much better estimate of cardinality.

Now (some may argue) what if we want an _EXACT_ value for COUNT(*)? We
had better ask ourselves (in that circumstance) "Am I willing to lock
the entire table and scan it?" because that is what will be necessary to
get a truly exact value. Otherwise, you can get totals that are wildly
off-base if someone is doing a bulk import or deleting a large number of
records.

So:
SELECT reltuples FROM pg_class WHERE relname = <table_name>;

Is more often what is really wanted.

> Or did you just mean that count(*) is bad design in postgresql since
> there are usually better alternatives in postgresql?


If you are using COUNT(*) as an existence test, then substitute:

WHERE EXISTS(<criteria>)


Use the indexes (if possible) by WHERE clause restriction:

SELECT count(1) FROM <table_name> WHERE <condition_list>

Will use indexes if appropriate.


> I'm not joking. I'd like to learn.


I think this should be a FAQ because it is a (F)requently (A)sked
(Q)uestion.

IMO-YMMV.



> --
> Ivan Sergio Borgonovo
> http://www.webthatworks.it
>
>
> ---------------------------(end of

broadcast)---------------------------
> TIP 6: explain analyze is your friend


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #25 (permalink)  
Old 04-10-2008, 12:17 AM
Ivan Sergio Borgonovo
 
Posts: n/a
Default Re: count(*) and bad design was: Experiences withextensibility

On Wed, 9 Jan 2008 01:39:34 -0800
"Dann Corbit" <DCorbit@connx.com> wrote:

> > On Wed, 09 Jan 2008 00:06:45 -0800
> > "Joshua D. Drake" <jd@commandprompt.com> wrote:


> > > Granted there are scenarios where others are FASTER (SELECT
> > > COUNT(*)) but I find that if you are doing those items, you
> > > normally have a weird design anyway.


> > > Sincerely,


> > Sincerely, would you make an example of such a bad design?


> A program that estimates cardinality by doing SELECT COUNT(*) is a
> bad design. Assuming you have the wherewithal to vacuum your
> tables (or have autovacuum enabled) a query against the system
> tables will be a much better estimate of cardinality.
>
> Now (some may argue) what if we want an _EXACT_ value for
> COUNT(*)? We had better ask ourselves (in that circumstance) "Am I
> willing to lock the entire table and scan it?" because that is what
> will be necessary to get a truly exact value. Otherwise, you can
> get totals that are wildly off-base if someone is doing a bulk
> import or deleting a large number of records.


Please forgive my naiveness in this field but what does it mean an
"exact count" and what other DB means with "an exact count" and how
other DB deal with it?

How "count" is defined in the SQL standard?

Is there a real situation then where you really need the "exact"
count?

Am I right saying that:

select count(*) from ...
-- here count may already be different

and that:
select for update count(*)

could be very expensive? Or what would it mean to do a
select for update count(*) ...?


> I think this should be a FAQ because it is a (F)requently (A)sked
> (Q)uestion.


After reading your email I think the real problem is not how to do
otherwise but understand what count(*) really mean and when and if it
is really useful and when it can be avoided.

I'd write in the FAQ something in the line of:

- What count(*) really does?
- When it can be avoided?
- When it can't be avoided?

In my everyday use of count(*), after your email I can hardly spot a
place where I need an exact count.
But to better understand and being convinced that using count(*) is
bad design I think last question could help a lot.

How does count(*) with a where clause perform generally on postgresql
compared to other DB?
I'd expect it perform as good or better than other DB since now the
bottleneck should be how efficiently it can filter records... but
still a count(*) with a where clause will incur in the same problem
of what "exact" means.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #26 (permalink)  
Old 04-10-2008, 12:17 AM
Harald Armin Massa
 
Posts: n/a
Default Re: count(*) and bad design was: Experiences with extensibility

Ivan,

> Please forgive my naiveness in this field but what does it mean an
> "exact count" and what other DB means with "an exact count" and how
> other DB deal with it?


PostgreSQL will give you an exact count of the contents of the
database as it is in the moment you begin your count. (i.e. the
transaction starts)

BUT as the table is not locked, in parallel somebody can bulkload MANY
items into the database, so at the moment (start of your transaction)
+ 1msec your count may be invalid allready.

> I'd expect it perform as good or better than other DB since now the
> bottleneck should be how efficiently it can filter records... but
> still a count(*) with a where clause will incur in the same problem
> of what "exact" means.


I know of 3 concepts to answer count() faster then PostreSQL:

1) just lie, present an estimate
2) do not have MVCC
3) store "record deleted info" in index, so you can answer count()
with only scanning the index

Harald


--
GHUM Harald Massa
persuadere et programmare
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607
fx 01212-5-13695179
-
EuroPython 2008 will take place in Vilnius, Lithuania - Stay tuned!

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #27 (permalink)  
Old 04-10-2008, 12:17 AM
Sim Zacks
 
Posts: n/a
Default Re: Experiences with extensibility

I believe I was misunderstood. The fact that a product is closed source does not
make it a better product. Some companies that are using Oracle would be better
off using PostgreSQL. Other companies that need the features that Oracle offers
would not be better off using Postgresql.

However, there are a lot of closed source products that are better then their
open source counter-parts.
Oracle vs. Postgresql is one of them. Obviously if you don't need the
feature-set provided by Oracle then you would be foolish for paying for it. But
if you do need the extra features, then it is worth it.

Sim




Clodoaldo wrote:
> 2008/1/9, Sim Zacks <sim@compulab.co.il>:
>> The reason companies go with the closed source, expensive solutions is because
>> they are better products.

>
> Not necessarily. FOSS products don't have a selling team to persuade
> and bribe people. Expensive solutions, and that is in part what make
> them expensive, can spend lots of time persuading and can offer good
> money to those who decide which is the "best" product. Those who
> decide are not the coders or db admins and in general don't really
> care much.
>
> Regards, Clodoaldo Pinto Neto
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #28 (permalink)  
Old 04-10-2008, 12:17 AM
Andrew Sullivan
 
Posts: n/a
Default Re: Experiences with extensibility

On Tue, Jan 08, 2008 at 11:37:38PM -0700, Guido Neitzer wrote:
> Like, I have a situation where I need multi-master just for
> availability. Two small servers are good enough for that. But
> unfortunately with PostgreSQL the whole setup is a major pain in the ...


Really? I don't think a RAID array with heartbeat and failover is that big
a pain. It requires some careful implementation, but it can be made to work
well, I think.

A


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #29 (permalink)  
Old 04-10-2008, 12:17 AM
Ivan Sergio Borgonovo
 
Posts: n/a
Default Re: count(*) and bad design was: Experiences withextensibility

On Wed, 09 Jan 2008 16:33:54 +0200
Sim Zacks <sim@compulab.co.il> wrote:

> Using count(*) is not bad design, though generally it makes sense
> to use it with a where.


I got the impression from others comments that postgresql
under perform other DB even when a where clause on indexed column is
involved.

I may have misinterpreted this but still could someone clarify?

Harald Armin Massa wrote:

> 1) just lie, present an estimate
> 2) do not have MVCC
> 3) store "record deleted info" in index, so you can answer count()
> with only scanning the index


I got the impression that even counting with clauses on on indexed
columns means you'll have to check if columns are still there. That
seems to imply that the extra cost make pg under perform compared to
other DB even in that scenario.

I beg pardon to Harald if I misinterpreted his words.

> Saying using count(*) is bad design means that the only design that
> you can visualize is the specific one that you are using.


I'd be interested in some example by Joshua otherwise I can't
understand what he meant.

If you're interested in all the record in a table, there is no way to
have an "engraved in stone" answer and since there are no "where
clauses" you can cache that info and update it once in a while.
If you have a where clause I'm not expecting an "engraved in stone"
answer but I'd expect to have a quick way to get an estimate and I
still haven't understood if:
a) I could improve my design to avoid count
b) postgres perform as good as other db on count where there is a
where clause
c) is there a way to have a quick estimate avoiding count when there
is a where clause

> There are tons of real world examples where you need count. That is
> why so many people use it as a benchmark.


> Obviously if you have an application where millions of rows are
> added and subtracted every minute, then the value of count is kind
> of vague.


> However, if you are querying a data warehouse that gets populated
> once a month, then count has a huge value. You could use statistics
> in that case, but it is much harder to join against other tables
> when you are using statistics. It is also less intuitive then using
> the SQL standard for finding the number of rows.


Again: paging records. You can't do statistics.
Surely you could optimise and lie... but that comes to a cost
compared to the simplicity of count.

Still everybody knows that a frequent complain about postgresql is it
has a slow count.

I can understand grey answer provided they are coherent.
a) the above claim is false
b) this claim is true just on cases where you could opt for a better
design
c) this claim is false for count without where clause
d) this claim is true
e) ...

details on b) would be much appreciated. Other cases require just a
yes/no answer.

thanks

--
Ivan Sergio Borgonovo
http://www.webthatworks.it


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #30 (permalink)  
Old 04-10-2008, 12:17 AM
Scott Marlowe
 
Posts: n/a
Default Re: count(*) and bad design was: Experiences with extensibility

On Jan 9, 2008 8:33 AM, Sim Zacks <sim@compulab.co.il> wrote:
> Using count(*) is not bad design, though generally it makes sense to use it with
> a where.
>
> Saying using count(*) is bad design means that the only design that you can
> visualize is the specific one that you are using.
>
> There are tons of real world examples where you need count. That is why so many
> people use it as a benchmark.
>
> Obviously if you have an application where millions of rows are added and
> subtracted every minute, then the value of count is kind of vague.
>
> However, if you are querying a data warehouse that gets populated once a month,
> then count has a huge value. You could use statistics in that case, but it is
> much harder to join against other tables when you are using statistics. It is
> also less intuitive then using the SQL standard for finding the number of rows.


I think part of the problem is people think of count() as something
other than an aggregate function.

If I ran "select avg(i) from table" on a 20M row table, I'd expect it
to take a few seconds, after all, I'm running a function across 20
Million rows.

Some databases have the ability to short-circuit count(*) without a
where clause, some with a where clause. But the basic model of
count(*) is that it's an aggregate function, and what you're asking
the db to do is to count every single row in the db that matches your
where clause.

Now, everything's a tradeoff. If PostgreSQL had visibility
information in the indexes, it would have to lock both the table and
index for every write, thus slowing down all the other queries that
are trying to access the table. It would be a tradeoff that
sacrificed write speed for read speed. In a db that was used mostly
for writing, it would likely be a fair trade. In a db that did a lot
of writing, it might slow the whole thing to a crawl.

So, the slow count(*) performance of postgresql, especially count(*)
without a where clause, is a result of the TANSTAAFL principle (there
ain't no such thing as a free lunch).

Now, if there's a where clause that's selective enough, then a
count(*) query may use the index and be a bit faster, but remember, in
pgsql, it's still got to actually hit the table to see if each tuple
really is visible to this transaction, so the index needs to be fairly
selective to be a win.

A possible workaround is to have something like a separate table with
nothing but the IDs and whatever would be in your where clause for the
tables you're accessing with a foreign key to it, and use THAT for a
count(*). Since the rows are skinnier, the count(*) will be faster.
Another alternative is to have a trigger fire that keeps a track of
the size of the table in a summary table when rows are added and
deleted.

Each of these methods "costs" you something, in time and effort or
performance, and that's why they're not automatic. For instance, I
have an 80M row stats db that grows by about 1M rows a week. I do NOT
need to count the whole thing, hardly ever, and can live with the fact
that I don't know exactly how many rows it has at any given time.
When I do a select count(*) with a where clause it's usually
restricted to < 1 weeks data and can use an index and come back pretty
quickly.

select count(*) from bigtable where inserttime > now() - interval '1 week';
count
---------
1254269
(1 row)

Time: 21422.368 ms

Second run:
select count(*) from bigtable where inserttime > now() - interval '1 week';
count
---------
1254320
(1 row)

Time: 2001.204 ms

With the data loaded into shared_buffers / linux kernel cache, that's
not too bad.

Now, I run the same query against our production oracle machine, which
is a MUCH more powerful server...

SQL> select count(*) from bigtable where inserttime > SYSDATE-7;

COUNT(*)
----------
1255972

Elapsed: 00:00:18.62

second run:
SQL> select count(*) from bigtable where inserttime > SYSDATE-7;

COUNT(*)
----------
1255973
Elapsed: 00:00:00.98

Now, obviously, Oracle's got some optimizations for what it's got in
the buffer there, but the first run isn't really any faster. In fact,
for a much more powerful machine, the performance was, relatively
speaking, pretty bad compared to my little 1 CPU 1 sw RAID-10
reporting server.

So, while PostgreSQL's count(*) performance isn't blindingly fast,
it's not the dog some people make it out to be either.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 11:09 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com