This is a discussion on Experiences with extensibility within the Pgsql General forums, part of the PostgreSQL category; --> On Wed, Jan 09, 2008 at 12:38:43PM -0700, Guido Neitzer wrote: > >>Easy multi-master clustering with just two machines. ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| On Wed, Jan 09, 2008 at 12:38:43PM -0700, Guido Neitzer wrote: > >>Easy multi-master clustering with just two machines. > As I said: FrontBase is offering that. It looks like a two-phase commit answer, if I'm reading correctly. You can do this today on many systems (including Postgres), but the classical problem with 2PC clustering is that it is very slow compared to single-node systems. Anyway, FrontBase looks interesting. Thanks for the reference. A ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly |
| |||
| It would be an administrative nightmare unless you had very few where clauses that you were tracking. Instead of using a trigger, you could use Listen/Notify to call a daemon on the server to run the procedure and then you have no insert/delete overhead. Or you could call the function on a cron job every 10 minutes... Chris Browne wrote: > zb@cybertec.at (Zoltan Boszormenyi) writes: >> which will be fast and depending on the initial value of COUNT(*) >> it will be very close to the exact figure. You can extend the example >> with more columns if you know your SELECT COUNT(*) ... WHERE >> conditions in advance but this way you have to keep several administrative >> tables for different monitored tables. Again, this trades some disk space >> and INSERT/DELETE operation speed on the monitored tables for >> quicker count. > > Actually, this approach will be Really Terrible for any cases where > multiple connections are adding/deleting tuples concurrently, as it > will force ALL updates to serialize behind the update to the central > table. > > Occasionally, you'll have something even worse, namely a deadlock, > where two or more of the updates fighting over the single summary > tuple fall into a bad state, and one of them is forced to give up, > potentially rolling back its whole transaction. > > [Waving hands for a moment] > > What I would do *instead* would be for each INSERT to add a tuple with > a count of 1, and for each DELETE to add a tuple with count of -1, and > then to periodically have a single process walk through to summarize > the table. There may be a further optimization to be had by doing a > per-statement trigger that counts the number of INSERTs/DELETEs done, > so that inserting 30 tuples (in the table being tracked) leads to > adding a single tuple with count of 30 in the summary table. > > That changes the tradeoffs, again... > > - Since each INSERT/DELETE is simply doing an INSERT into the summary > table, the ongoing activity is *never* blocking anything > > - You get the count by requesting > SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 'foo'; > > - Note that the query is MVCC-consistent with the table! > > - Once in a while, you'll want to run a single procedure that, for > each table, deletes all the existing records, and replaces them > with a single one consisting of the sum of the individual values. > > - You can re-sync a table by running the query: > begin; > delete from record_count where tablename = 'foo'; > insert into record_count(tablename, rec_cnt) select 'foo', (select count(*) from foo); > commit; |
| |||
| On Wed, Jan 09, 2008 at 03:28:04PM +0100, Ivan Sergio Borgonovo wrote: > Let me consider an everyday use where count() looks as the most > obvious solution: paging. > > I search trough a table and I need to know which is the last page. There's an often overlooked solution to this. Let's say your count returns 100,000 records, are you going to give them link to 1000 different pages? Not really. Probably about 10, so really your count only is interested in an exact result less than 100, or that's it's more than 100. By placing the where clause in a subselect with a limit of 101 and a count() around it you have an upper bound on the cost of the count, the result 101 simply means "more than 100". Depending on the data you might just put the limit on the query that fetches the data and using everything after the 10th record to determine your count on the client side and skip the extra round trip. > Is there a way to count based on indexes without taking into account > deleted rows so to "count" faster? I've also often pondered whether the SQL standard support for table sampling would be good here. Sure, you still need to check visibility, but if you specify that the DB only needs to check 10% of the tuples and to extrapolate the results from that, you could get a fast yet reasonably accurate result. IIRC patches for this have been floated on the lists. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFHhfo2IB7bNG8LQkwRAviuAJ9nbGeFCOdr4m8QGq1AjD 70Y3HE7QCdERPN g/ssQLq3Dsyqz1Z6vjqaumQ= =JOy1 -----END PGP SIGNATURE----- |
| |||
| In article <60ejcqy6j0.fsf@dba2.int.libertyrms.com>, Chris Browne <cbbrowne@acm.org> writes: > There may be a further optimization to be had by doing a > per-statement trigger that counts the number of INSERTs/DELETEs done, > so that inserting 30 tuples (in the table being tracked) leads to > adding a single tuple with count of 30 in the summary table. This would be nice, but at least the 8.2.4 docs say Statement-level triggers do not currently have any way to examine the individual row(s) modified by the statement. Is this restriction removed in a later version? ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| |||
| Harald Fuchs wrote: > In article <60ejcqy6j0.fsf@dba2.int.libertyrms.com>, > Chris Browne <cbbrowne@acm.org> writes: > > > There may be a further optimization to be had by doing a > > per-statement trigger that counts the number of INSERTs/DELETEs done, > > so that inserting 30 tuples (in the table being tracked) leads to > > adding a single tuple with count of 30 in the summary table. > > This would be nice, but at least the 8.2.4 docs say > > Statement-level triggers do not currently have any way to examine > the individual row(s) modified by the statement. > > Is this restriction removed in a later version? Nope. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---------------------------(end of broadcast)--------------------------- TIP 4: Have you searched our list archives? http://archives.postgresql.org/ |
| |||
| On 09.01.2008, at 13:51, Martin wrote: > I've been working with FrontBase a lot lately and I wouldn't say > anything about it qualifies as "incredibly easy" and reliable it > is not. We had never ever any reliability issues with FrontBase as long as didn't try to insert garbage. It really doesn't like that. > Performance of FrontBase is just plain terrible. One of > our reports takes 9 minutes on FrontBase and 10 seconds on > Postgres. As I said: depends on what you are doing. Which version did you use where you got that terrible performance? The latest one? There was a bug in there query planner in an old version that totally killed some queries. There is also the or-query problem (not sure whether that one is still in there): if your report does something like "select bar from foo where a = 1 or b = 1;" it didn't use indexes on a or b which is terrible. If you have that, use separate selects combined with a union if you can. In my cases it was never as fast as PostgreSQL as soon as there are more than two tables involved. But except a couple of cases where I had to use PostgreSQL for performance reasons, I was okay with it. But performance is not everything. PostgreSQL is for most cases the better product, but for some cases, FrontBase is. Depends on what you're doing how much which side has ... > Then there's the documentation issue... PostgreSQL documentation is better, but what is your issue exactly? The main problem I have with the FrontBase documentation is that it mostly gives no useful examples. As this is off-topic here: we can transfer that either to the FrontBase list or to private mail if you like. cug -- http://www.event-s.net ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match |
| |||
| Chris Browne wrote: > zb@cybertec.at (Zoltan Boszormenyi) writes: >> SELECT COUNT(*) > [Waving hands for a moment] Would what Chris describes below be a good candidate for a pgfoundry project that has functions that'll create the triggers for you? (yeah, I might be volunteering, but would undoubtedly need help) Then when people ask it again and again the response could be "install http://pgfoundry.org/fast_count_star" rather than "go read the mailing list archives and roll your own - but remember to worry about deadlock and contention on the table containing the counts". > What I would do *instead* would be for each INSERT to add a tuple with > a count of 1, and for each DELETE to add a tuple with count of -1, and > then to periodically have a single process walk through to summarize > the table. There may be a further optimization to be had by doing a > per-statement trigger that counts the number of INSERTs/DELETEs done, > so that inserting 30 tuples (in the table being tracked) leads to > adding a single tuple with count of 30 in the summary table. > > That changes the tradeoffs, again... > > - Since each INSERT/DELETE is simply doing an INSERT into the summary > table, the ongoing activity is *never* blocking anything > > - You get the count by requesting > SELECT SUM(rec_cnt) as record_count from rec_count where tablename = 'foo'; > > - Note that the query is MVCC-consistent with the table! > > - Once in a while, you'll want to run a single procedure that, for > each table, deletes all the existing records, and replaces them > with a single one consisting of the sum of the individual values. > > - You can re-sync a table by running the query: > begin; > delete from record_count where tablename = 'foo'; > insert into record_count(tablename, rec_cnt) select 'foo', (select count(*) from foo); > commit; |
| |||
| On Jan 9, 2008, at 8:07 PM, Scott Marlowe wrote: > I could see a use for an approximate count(*) with where clause, just > like I could see a use for the ability to retrieve random rows from a > table without using order by random() on it. And those are both > things that would require some form of hacking in the db that I'm > certainly not capable of pulling off... About returning random rows... I've successfully applied a scrolling cursor for that. You need to scroll to the last row to find the size of the result set, but after that it's pretty easy to return random rows by scrolling to them (and marking them 'read' in some way to prevent accidentally returning the same row again). It does require some specific application code though - doing it server side would mean to pass the query as a function argument (which still requires unnatural SQL statements in your application code) or write a function for each query (*cough*). Performance was quite adequate (a few 100 ms) for a query returning random 5 rows from 3 joined tables or more, some of which had a few 100k rows. Calculating random() for each record in the result set (to sort on) was taking much longer. That was on a dual 64-bit opteron with 4GB RAM, iirc. Of course a built-in statement would be preferable, I just felt like pointing out that order by random() isn't necessarily the best alternative Regards, Alban Hertroys. !DSPAM:737,478cb43e9496078213597! ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |
| |||
| On Tue, 15 Jan 2008 14:43:35 +0100 Alban Hertroys <dalroi@solfertje.student.utwente.nl> wrote: > You need to scroll to the last row to find the size of the result > set, but after that it's pretty easy to return random rows by > scrolling to them (and marking them 'read' in some way to prevent > accidentally returning the same row again). Could you post a snippet of code or something giving a more detailed idea of it? BTW since cursors support offset if you're not interested if the order of the retrieved rows is random too you don't even have to remember which one you read I think. -- Ivan Sergio Borgonovo http://www.webthatworks.it ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend |
| ||||
| On Jan 15, 2008, at 3:03 PM, Ivan Sergio Borgonovo wrote: > On Tue, 15 Jan 2008 14:43:35 +0100 > Alban Hertroys <dalroi@solfertje.student.utwente.nl> wrote: > >> You need to scroll to the last row to find the size of the result >> set, but after that it's pretty easy to return random rows by >> scrolling to them (and marking them 'read' in some way to prevent >> accidentally returning the same row again). > > Could you post a snippet of code or something giving a more detailed > idea of it? > > BTW since cursors support offset if you're not interested if the > order of the retrieved rows is random too you don't even have to > remember which one you read I think. I posted it on this list a while ago when I came up with this solution. I had some trouble finding my old post in the pgsql-general archives though - I could find the thread, just not my final posting, and searching didn't even turn up the thread. I did find it here: http://www.mail-archive.com/pgsql- general@postgresql.org/msg103670.html The thread contains several other approaches to the problem, it really depends on your problem domain which one fits your bill. I think the function in my original posting could do with clearer comments though, so here's the function again: /* * Return $limit random rows from the result set of SQL query $query */ function randomSet( $query, // The query to execute $limit // The (max) number of random rows required ) { // SQL to declare the cursor query("DECLARE _cur SCROLL CURSOR WITHOUT HOLD FOR $query"); /* Get the range for random(1, n) * * Determined by scrolling the cursor to the last row. * Equivalent to select count(*), but without a separate query. */ query("MOVE FORWARD ALL IN _cur"); $count = pg_affected_rows(); $uniques = array(); // A list of used cursor offsets $resultSet = array(); // Fetch random rows until we have enough or there are no more while ($limit > 0 && count($uniques) < $count) { // Determine random scroll offset $idx = random(1, $count); // Skip records with an index we already used if (in_array($idx, $uniques)) continue; //Fetch the random row $record = query("FETCH ABSOLUTE $idx FROM _cur"); // Add the row offset to the list of used offsets $uniques[] = $idx; $resultSet[] = $record; $limit--; } // query query("CLOSE _cur"); return $resultSet; } Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,478f32e59497683469944! ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster |