Unix Technical Forum

Re: Much Ado About COUNT(*)

This is a discussion on Re: Much Ado About COUNT(*) within the pgsql Hackers forums, part of the PostgreSQL category; --> > Date: Wed, 12 Jan 2005 18:45:09 -0800 > From: Jeff Davis <jdavis-pgsql@empires.org> > To: Alvaro Herrera <alvherre@dcc.uchile.cl> > ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > pgsql Hackers

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-11-2008, 03:22 AM
Mark Cave-Ayland
 
Posts: n/a
Default Re: Much Ado About COUNT(*)

> Date: Wed, 12 Jan 2005 18:45:09 -0800
> From: Jeff Davis <jdavis-pgsql@empires.org>
> To: Alvaro Herrera <alvherre@dcc.uchile.cl>
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: Much Ado About COUNT(*)
> Message-ID: <1105584309.2886.410.camel@jeff>


(cut)

> Thanks for the link. It looks like it breaks it up into chunks of about

2KB. I think the
> conversation was mostly assuming the tables were somewhat closer to the

size of an
> index. If you have more than 2KB per tuple, pretty much anything you do

with an index
> would be faster I would think.



Hi Jeff/Alvaro,

I'm considering an application at the moment whereby I would need to do lots
of COUNT(*) on lots of separate tables without a WHERE clause. Would
something like the following help speed up the COUNT(*) by reducing the
tuple size being used for the count?


CREATE SEQUENCE id_seq;

CREATE TABLE person_count (
id int8
);

CREATE TABLE person (
id int8 DEFAULT nextval('id_seq');
first_name text,
surname text,
age int,
address1 text,
address2 text,
address3 text,
address4 text,
postcode text
tel text
);

For each insert:

BEGIN;
INSERT INTO person (first_name, .... Tel) VALUES ('Fred', ....
'12345');
INSERT INTO person_count(id) VALUES (currval('id_seq'));
COMMIT;


So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
know the current number of person records. How much quicker would a COUNT(*)
be if visibility were included in the indices as opposed to a "hacked"
approach like this?


Many thanks,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk




---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-11-2008, 03:22 AM
Bruno Wolff III
 
Posts: n/a
Default Re: Much Ado About COUNT(*)

On Wed, Jan 19, 2005 at 14:59:17 -0000,
Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote:
> BEGIN;
> INSERT INTO person (first_name, .... Tel) VALUES ('Fred', ....
> '12345');
> INSERT INTO person_count(id) VALUES (currval('id_seq'));
> COMMIT;
>
>
> So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
> know the current number of person records. How much quicker would a COUNT(*)
> be if visibility were included in the indices as opposed to a "hacked"
> approach like this?


You are only going to get a constant factor speed up unless the space savings
allows much better use of cache. You probably want to look at using
triggers to maintain counts in another table.

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-11-2008, 03:22 AM
Alvaro Herrera
 
Posts: n/a
Default Re: Much Ado About COUNT(*)

On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote:
> On Wed, Jan 19, 2005 at 14:59:17 -0000,
> Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote:
>
> > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
> > know the current number of person records. How much quicker would a COUNT(*)
> > be if visibility were included in the indices as opposed to a "hacked"
> > approach like this?

>
> You are only going to get a constant factor speed up unless the space savings
> allows much better use of cache. You probably want to look at using
> triggers to maintain counts in another table.


I'd try using a "start value" and a differences list. So the
differences list would be initially empty and the start value would be
0. On insert or delete, you create a new difference (with +1 or
whatever). Periodically, the differences would be added to the start
value and the records deleted. Thus the time to calculate the total
count is much smaller, and it follows MVCC rules. Of course there are
lots of minor details not mentioned here.

Not sure if I'd model this with a single table or two.

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"I would rather have GNU than GNOT." (ccchips, lwn.net/Articles/37595/)

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-11-2008, 03:23 AM
Jeff Davis
 
Posts: n/a
Default Re: Much Ado About COUNT(*)


To fill in some details I think what he's saying is this:

=> create table foo(...);
=> create table foo_count(num int);
=> insert into foo_count values(0);
=> create table foo_change(num int);

then create a trigger "after delete on foo" that does "insert into
foo_change values(-1)" and a trigger "after insert on foo" that inserts
a +1 into foo_change.

Periodically, do:
=> begin;
=> set transaction isolation level serializable;
=> update foo_count set num=num+(select sum(num) from foo_change);
=> delete from foo_change;
=> commit;
=> VACUUM;

And then any time you need the correct count(*) value, do instead:
=> select sum(num) from (select num from foo_count union select num from
foo_change);

And that should work. I haven't tested this exact example, so I may have
overlooked something.

Hope that helps. That way, you don't have huge waste from the second
table, and also triggers maintain it for you and you don't need to think
about it.

Regards,
Jeff Davis

On Wed, 2005-01-19 at 17:40 -0300, Alvaro Herrera wrote:
> On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote:
> > On Wed, Jan 19, 2005 at 14:59:17 -0000,
> > Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote:
> >
> > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to
> > > know the current number of person records. How much quicker would a COUNT(*)
> > > be if visibility were included in the indices as opposed to a "hacked"
> > > approach like this?

> >
> > You are only going to get a constant factor speed up unless the space savings
> > allows much better use of cache. You probably want to look at using
> > triggers to maintain counts in another table.

>
> I'd try using a "start value" and a differences list. So the
> differences list would be initially empty and the start value would be
> 0. On insert or delete, you create a new difference (with +1 or
> whatever). Periodically, the differences would be added to the start
> value and the records deleted. Thus the time to calculate the total
> count is much smaller, and it follows MVCC rules. Of course there are
> lots of minor details not mentioned here.
>
> Not sure if I'd model this with a single table or two.
>



---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-11-2008, 03:23 AM
D'Arcy J.M. Cain
 
Posts: n/a
Default Re: Much Ado About COUNT(*)

On Thu, 20 Jan 2005 10:12:17 -0000
"Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote:
> Thanks for the information. I seem to remember something similar to
> this being discussed last year in a similar thread. My only real issue
> I can see with this approach is that the trigger is fired for every
> row, and it is likely that the database I am planning will have large
> inserts of several hundred thousand records. Normally the impact of
> these is minimised by inserting the entire set in one transaction. Is
> there any way that your trigger can be modified to fire once per
> transaction with the number of modified rows as a parameter?


I don't believe that such a facility exists but before dismissing it you
should test it out. I think that you will find that disk buffering (the
system's as well as PostgreSQL's) will effectively handle this for you
anyway.

--
D'Arcy J.M. Cain <darcy@druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-11-2008, 03:23 AM
Mark Cave-Ayland
 
Posts: n/a
Default Re: Much Ado About COUNT(*)


> -----Original Message-----
> From: Jeff Davis [mailto:jdavis-pgsql@empires.org]
> Sent: 19 January 2005 21:33
> To: Alvaro Herrera
> Cc: Mark Cave-Ayland; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Much Ado About COUNT(*)
>
>
>
> To fill in some details I think what he's saying is this:
>
> => create table foo(...);
> => create table foo_count(num int);
> => insert into foo_count values(0);
> => create table foo_change(num int);
>
> then create a trigger "after delete on foo" that does "insert
> into foo_change values(-1)" and a trigger "after insert on
> foo" that inserts a +1 into foo_change.
>
> Periodically, do:
> => begin;
> => set transaction isolation level serializable;
> => update foo_count set num=num+(select sum(num) from
> foo_change); => delete from foo_change; => commit; => VACUUM;
>
> And then any time you need the correct count(*) value, do
> instead: => select sum(num) from (select num from foo_count
> union select num from foo_change);
>
> And that should work. I haven't tested this exact example, so
> I may have overlooked something.
>
> Hope that helps. That way, you don't have huge waste from the
> second table, and also triggers maintain it for you and you
> don't need to think about it.
>
> Regards,
> Jeff Davis



Hi Jeff,

Thanks for the information. I seem to remember something similar to this
being discussed last year in a similar thread. My only real issue I can see
with this approach is that the trigger is fired for every row, and it is
likely that the database I am planning will have large inserts of several
hundred thousand records. Normally the impact of these is minimised by
inserting the entire set in one transaction. Is there any way that your
trigger can be modified to fire once per transaction with the number of
modified rows as a parameter?


Many thanks,

Mark.

------------------------
WebBased Ltd
South West Technology Centre
Tamar Science Park
Plymouth
PL6 8BT

T: +44 (0)1752 791021
F: +44 (0)1752 791023
W: http://www.webbased.co.uk




---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-11-2008, 03:23 AM
Richard Huxton
 
Posts: n/a
Default Re: Much Ado About COUNT(*)

D'Arcy J.M. Cain wrote:
> On Thu, 20 Jan 2005 10:12:17 -0000
> "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote:
>
>>Thanks for the information. I seem to remember something similar to
>>this being discussed last year in a similar thread. My only real issue
>>I can see with this approach is that the trigger is fired for every
>>row, and it is likely that the database I am planning will have large
>>inserts of several hundred thousand records. Normally the impact of
>>these is minimised by inserting the entire set in one transaction. Is
>>there any way that your trigger can be modified to fire once per
>>transaction with the number of modified rows as a parameter?

>
>
> I don't believe that such a facility exists but before dismissing it you
> should test it out. I think that you will find that disk buffering (the
> system's as well as PostgreSQL's) will effectively handle this for you
> anyway.


Well, it looks like ROW_COUNT isn't set in a statement-level trigger
function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise
it would be easy to handle. It should be possible to expose this
information though, since it gets reported at the command conclusion.

--
Richard Huxton
Archonet Ltd

-- stmt_trig_test.sql --
BEGIN;

CREATE TABLE trigtest (
a int4 NOT NULL,
b text,
PRIMARY KEY (a)
);

CREATE FUNCTION tt_test_fn() RETURNS TRIGGER AS '
DECLARE
nr integer;
ro integer;
nr2 integer;
BEGIN
GET DIAGNOSTICS nr = ROW_COUNT;
GET DIAGNOSTICS ro = RESULT_OID;
SELECT count(*) INTO nr2 FROM trigtest;

RAISE NOTICE ''nr = % / ro = % / nr2 = %'',nr,ro,nr2;

RETURN NULL;
END;
' LANGUAGE plpgsql;

CREATE TRIGGER tt_test AFTER INSERT OR UPDATE ON trigtest
FOR EACH STATEMENT
EXECUTE PROCEDURE tt_test_fn();

INSERT INTO trigtest VALUES (1,'a');
INSERT INTO trigtest VALUES (2,'b');
UPDATE trigtest SET b = 'x';

ROLLBACK;

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 10:53 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com