This is a discussion on Re: Much Ado About COUNT(*) within the pgsql Hackers forums, part of the PostgreSQL category; --> > Date: Wed, 12 Jan 2005 18:45:09 -0800 > From: Jeff Davis <jdavis-pgsql@empires.org> > To: Alvaro Herrera <alvherre@dcc.uchile.cl> > ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| > Date: Wed, 12 Jan 2005 18:45:09 -0800 > From: Jeff Davis <jdavis-pgsql@empires.org> > To: Alvaro Herrera <alvherre@dcc.uchile.cl> > Cc: pgsql-hackers@postgresql.org > Subject: Re: Much Ado About COUNT(*) > Message-ID: <1105584309.2886.410.camel@jeff> (cut) > Thanks for the link. It looks like it breaks it up into chunks of about 2KB. I think the > conversation was mostly assuming the tables were somewhat closer to the size of an > index. If you have more than 2KB per tuple, pretty much anything you do with an index > would be faster I would think. Hi Jeff/Alvaro, I'm considering an application at the moment whereby I would need to do lots of COUNT(*) on lots of separate tables without a WHERE clause. Would something like the following help speed up the COUNT(*) by reducing the tuple size being used for the count? CREATE SEQUENCE id_seq; CREATE TABLE person_count ( id int8 ); CREATE TABLE person ( id int8 DEFAULT nextval('id_seq'); first_name text, surname text, age int, address1 text, address2 text, address3 text, address4 text, postcode text tel text ); For each insert: BEGIN; INSERT INTO person (first_name, .... Tel) VALUES ('Fred', .... '12345'); INSERT INTO person_count(id) VALUES (currval('id_seq')); COMMIT; So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to know the current number of person records. How much quicker would a COUNT(*) be if visibility were included in the indices as opposed to a "hacked" approach like this? Many thanks, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| On Wed, Jan 19, 2005 at 14:59:17 -0000, Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote: > BEGIN; > INSERT INTO person (first_name, .... Tel) VALUES ('Fred', .... > '12345'); > INSERT INTO person_count(id) VALUES (currval('id_seq')); > COMMIT; > > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to > know the current number of person records. How much quicker would a COUNT(*) > be if visibility were included in the indices as opposed to a "hacked" > approach like this? You are only going to get a constant factor speed up unless the space savings allows much better use of cache. You probably want to look at using triggers to maintain counts in another table. ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote: > On Wed, Jan 19, 2005 at 14:59:17 -0000, > Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote: > > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to > > know the current number of person records. How much quicker would a COUNT(*) > > be if visibility were included in the indices as opposed to a "hacked" > > approach like this? > > You are only going to get a constant factor speed up unless the space savings > allows much better use of cache. You probably want to look at using > triggers to maintain counts in another table. I'd try using a "start value" and a differences list. So the differences list would be initially empty and the start value would be 0. On insert or delete, you create a new difference (with +1 or whatever). Periodically, the differences would be added to the start value and the records deleted. Thus the time to calculate the total count is much smaller, and it follows MVCC rules. Of course there are lots of minor details not mentioned here. Not sure if I'd model this with a single table or two. -- Alvaro Herrera (<alvherre[@]dcc.uchile.cl>) "I would rather have GNU than GNOT." (ccchips, lwn.net/Articles/37595/) ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend |
| |||
| To fill in some details I think what he's saying is this: => create table foo(...); => create table foo_count(num int); => insert into foo_count values(0); => create table foo_change(num int); then create a trigger "after delete on foo" that does "insert into foo_change values(-1)" and a trigger "after insert on foo" that inserts a +1 into foo_change. Periodically, do: => begin; => set transaction isolation level serializable; => update foo_count set num=num+(select sum(num) from foo_change); => delete from foo_change; => commit; => VACUUM; And then any time you need the correct count(*) value, do instead: => select sum(num) from (select num from foo_count union select num from foo_change); And that should work. I haven't tested this exact example, so I may have overlooked something. Hope that helps. That way, you don't have huge waste from the second table, and also triggers maintain it for you and you don't need to think about it. Regards, Jeff Davis On Wed, 2005-01-19 at 17:40 -0300, Alvaro Herrera wrote: > On Wed, Jan 19, 2005 at 10:16:38AM -0600, Bruno Wolff III wrote: > > On Wed, Jan 19, 2005 at 14:59:17 -0000, > > Mark Cave-Ayland <m.cave-ayland@webbased.co.uk> wrote: > > > > > So then I would use SELECT COUNT(*) FROM person_count whenever I wanted to > > > know the current number of person records. How much quicker would a COUNT(*) > > > be if visibility were included in the indices as opposed to a "hacked" > > > approach like this? > > > > You are only going to get a constant factor speed up unless the space savings > > allows much better use of cache. You probably want to look at using > > triggers to maintain counts in another table. > > I'd try using a "start value" and a differences list. So the > differences list would be initially empty and the start value would be > 0. On insert or delete, you create a new difference (with +1 or > whatever). Periodically, the differences would be added to the start > value and the records deleted. Thus the time to calculate the total > count is much smaller, and it follows MVCC rules. Of course there are > lots of minor details not mentioned here. > > Not sure if I'd model this with a single table or two. > ---------------------------(end of broadcast)--------------------------- TIP 4: Don't 'kill -9' the postmaster |
| |||
| On Thu, 20 Jan 2005 10:12:17 -0000 "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote: > Thanks for the information. I seem to remember something similar to > this being discussed last year in a similar thread. My only real issue > I can see with this approach is that the trigger is fired for every > row, and it is likely that the database I am planning will have large > inserts of several hundred thousand records. Normally the impact of > these is minimised by inserting the entire set in one transaction. Is > there any way that your trigger can be modified to fire once per > transaction with the number of modified rows as a parameter? I don't believe that such a facility exists but before dismissing it you should test it out. I think that you will find that disk buffering (the system's as well as PostgreSQL's) will effectively handle this for you anyway. -- D'Arcy J.M. Cain <darcy@druid.net> | Democracy is three wolves http://www.druid.net/darcy/ | and a sheep voting on +1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner. ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html |
| |||
| > -----Original Message----- > From: Jeff Davis [mailto:jdavis-pgsql@empires.org] > Sent: 19 January 2005 21:33 > To: Alvaro Herrera > Cc: Mark Cave-Ayland; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Much Ado About COUNT(*) > > > > To fill in some details I think what he's saying is this: > > => create table foo(...); > => create table foo_count(num int); > => insert into foo_count values(0); > => create table foo_change(num int); > > then create a trigger "after delete on foo" that does "insert > into foo_change values(-1)" and a trigger "after insert on > foo" that inserts a +1 into foo_change. > > Periodically, do: > => begin; > => set transaction isolation level serializable; > => update foo_count set num=num+(select sum(num) from > foo_change); => delete from foo_change; => commit; => VACUUM; > > And then any time you need the correct count(*) value, do > instead: => select sum(num) from (select num from foo_count > union select num from foo_change); > > And that should work. I haven't tested this exact example, so > I may have overlooked something. > > Hope that helps. That way, you don't have huge waste from the > second table, and also triggers maintain it for you and you > don't need to think about it. > > Regards, > Jeff Davis Hi Jeff, Thanks for the information. I seem to remember something similar to this being discussed last year in a similar thread. My only real issue I can see with this approach is that the trigger is fired for every row, and it is likely that the database I am planning will have large inserts of several hundred thousand records. Normally the impact of these is minimised by inserting the entire set in one transaction. Is there any way that your trigger can be modified to fire once per transaction with the number of modified rows as a parameter? Many thanks, Mark. ------------------------ WebBased Ltd South West Technology Centre Tamar Science Park Plymouth PL6 8BT T: +44 (0)1752 791021 F: +44 (0)1752 791023 W: http://www.webbased.co.uk ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |
| ||||
| D'Arcy J.M. Cain wrote: > On Thu, 20 Jan 2005 10:12:17 -0000 > "Mark Cave-Ayland" <m.cave-ayland@webbased.co.uk> wrote: > >>Thanks for the information. I seem to remember something similar to >>this being discussed last year in a similar thread. My only real issue >>I can see with this approach is that the trigger is fired for every >>row, and it is likely that the database I am planning will have large >>inserts of several hundred thousand records. Normally the impact of >>these is minimised by inserting the entire set in one transaction. Is >>there any way that your trigger can be modified to fire once per >>transaction with the number of modified rows as a parameter? > > > I don't believe that such a facility exists but before dismissing it you > should test it out. I think that you will find that disk buffering (the > system's as well as PostgreSQL's) will effectively handle this for you > anyway. Well, it looks like ROW_COUNT isn't set in a statement-level trigger function (GET DIAGNOSTICS myvar=ROW_COUNT). Which is a shame, otherwise it would be easy to handle. It should be possible to expose this information though, since it gets reported at the command conclusion. -- Richard Huxton Archonet Ltd -- stmt_trig_test.sql -- BEGIN; CREATE TABLE trigtest ( a int4 NOT NULL, b text, PRIMARY KEY (a) ); CREATE FUNCTION tt_test_fn() RETURNS TRIGGER AS ' DECLARE nr integer; ro integer; nr2 integer; BEGIN GET DIAGNOSTICS nr = ROW_COUNT; GET DIAGNOSTICS ro = RESULT_OID; SELECT count(*) INTO nr2 FROM trigtest; RAISE NOTICE ''nr = % / ro = % / nr2 = %'',nr,ro,nr2; RETURN NULL; END; ' LANGUAGE plpgsql; CREATE TRIGGER tt_test AFTER INSERT OR UPDATE ON trigtest FOR EACH STATEMENT EXECUTE PROCEDURE tt_test_fn(); INSERT INTO trigtest VALUES (1,'a'); INSERT INTO trigtest VALUES (2,'b'); UPDATE trigtest SET b = 'x'; ROLLBACK; ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings |