Unix Technical Forum

Possible explanations for catastrophic performace deterioration?

This is a discussion on Possible explanations for catastrophic performace deterioration? within the Pgsql Performance forums, part of the PostgreSQL category; --> I recently had a puzzling experience (performace related). Had a DB running presumably smoothly, on a server with Dual-Core ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 11:36 AM
Carlos Moreno
 
Posts: n/a
Default Possible explanations for catastrophic performace deterioration?


I recently had a puzzling experience (performace related).

Had a DB running presumably smoothly, on a server with Dual-Core
Opteron and 4GB of RAM (and SATA2 drives with Hardware RAID-1).
(PostgreSQL 8.2.4 installed from source, on a FC4 system --- databases
with no encoding --- initdb -E SQL_ASCII --no-locale, and all the
databases created with encoding SQL_ASCII)

We thought that performance had gone little by little down, but the
evidence now suggests that something must have triggered a big step
down in the performance of the server.

Thinking that it was simply a bottleneck with the hardware, we moved
to a different machine (lower performance CPU-wise, but with dual hard
disk, so I configured the pg_xlog directory on a partition on a separate
hard disk, estimating that this would take precedence over the lower CPU
power and the 2GB of RAM instead of 4).

Not only the performance was faster --- a query like:

select count(*) from customer

was *instantaneous* on the new machine (just after populating it,
without having even analyzed it!), and would take over a minute on
the old machine (the first time). Then, the second time, it would
take a little over two seconds on the old machine (at this point, both
machines had *zero* activity --- they were both essentially disconnected
from the outside world; serving exclusively my psql connection).

Funny thing, I dropped the database (on the old machine) and re-created
it with the backup I had just created, and now the performance on the
old one was again normal (the above query now gives me a result in
essentially no time --- same as on the new machine).

In retrospect, I'm now wondering if a vacuum full would have solved
the issue? (we do run vacuumdb -z --- vacuum analyze --- daily)

Any comments?? I'm worried that three months down the road we'll
face the same issue with this new server (that's about the time it took
since we had started running the other server until we noticed the
poor performance level) --- and we can not afford to completely stop
the system to drop-and-recreate the db on a regular basis.

Thanks,

Carlos
--


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 11:36 AM
Jonah H. Harris
 
Posts: n/a
Default Re: Possible explanations for catastrophic performace deterioration?

You didn't specify the database size, but my guess is that the total
data size about enough to fit in shared_buffers or kernel cache. On
the new system (or dropped/recreated database), it would've all or
mostly fit in memory which would make things like count(*) work
quickly. On the old database, you probably had a lot of fragmentation
which would've caused significantly more I/O to be performed thereby
causing a slowdown. You could compare relation sizes to check easily.

My guess is that a vacuum full would've brought the other database
back up to speed. In the future, you probably want to set fillfactor
to a reasonable amount to account for updates-to-blocks-between-vacuum
to try and capture as few row-migrations as possible.
--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 11:36 AM
Tom Lane
 
Posts: n/a
Default Re: Possible explanations for catastrophic performace deterioration?

"Jonah H. Harris" <jonah.harris@gmail.com> writes:
> My guess is that a vacuum full would've brought the other database
> back up to speed.


Yeah, table bloat is what it sounds like to me too.

> In the future, you probably want to set fillfactor
> to a reasonable amount to account for updates-to-blocks-between-vacuum
> to try and capture as few row-migrations as possible.


More to the point, check your FSM size and make sure vacuums are
happening often enough.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 11:36 AM
Carlos Moreno
 
Posts: n/a
Default Re: Possible explanations for catastrophic performace deterioration?

Jonah H. Harris wrote:
> You didn't specify the database size


Oops, sorry about that one --- the full backup is a 950MB file. The
entire database
should fit in memory (and the effective_cache_size was set to 2GB for
the machine
with 4GB of memory)

> , but my guess is that the total
> data size about enough to fit in shared_buffers or kernel cache. On
> the new system (or dropped/recreated database), it would've all or
> mostly fit in memory which would make things like count(*) work
> quickly.


I don't understand this argument --- the newer system has actually less
memory
than the old one; how could it fit there and not on the old one? Plus,
how could
dropping-recreating the database on the same machine change the fact
that the
entire dataset entirely fit or not in memory??

The other part that puzzled me is that after running "select count(*)
.... " several
times (that particular table is *very* small --- just 200 thousand
records of no
more than 100 or 200 bytes each), then the entire table *should* have been
in memory ... Yet, it would still take a few seconds (notice that
there was a
*considerable* improvement from the first run of that query to the
second one
on the old server --- from more than a minute, to just above two
seconds.... But
still, on the new server, and after recreating the DB on the old one, it
runs in
*no time* the first time).

> My guess is that a vacuum full would've brought the other database
> back up to speed.


I'm furious now that it didn't occur to me the vacuum full until *after*
I had
recreated the database to see th problem disappear...

I wonder if I should then periodically run a vacuum full --- say, once a
week?
Once a month?

> In the future, you probably want to set fillfactor
> to a reasonable amount to account for updates-to-blocks-between-vacuum
> to try and capture as few row-migrations as possible.
>


Could you elaborate a bit on this? Or point me to the right places in the
documentation to help me understand the above?? (I'm 100% blank after
reading the above paragraph)

Thanks,

Carlos
--


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 11:36 AM
Alvaro Herrera
 
Posts: n/a
Default Re: Possible explanations for catastrophic performacedeterioration?

Carlos Moreno wrote:

>> , but my guess is that the total
>> data size about enough to fit in shared_buffers or kernel cache. On
>> the new system (or dropped/recreated database), it would've all or
>> mostly fit in memory which would make things like count(*) work
>> quickly.

>
> I don't understand this argument --- the newer system has actually
> less memory than the old one; how could it fit there and not on the
> old one? Plus, how could dropping-recreating the database on the same
> machine change the fact that the entire dataset entirely fit or not in
> memory??


Because on the older server it is bloated, while on the new one it is
fresh thus no dead tuples.


> The other part that puzzled me is that after running "select count(*)
> ... " several times (that particular table is *very* small --- just
> 200 thousand records of no more than 100 or 200 bytes each), then the
> entire table *should* have been in memory ... Yet, it would still
> take a few seconds (notice that there was a *considerable*
> improvement from the first run of that query to the second one on the
> old server --- from more than a minute, to just above two seconds....
> But still, on the new server, and after recreating the DB on the old
> one, it runs in *no time* the first time).


Bloat can explain this as well.

>> My guess is that a vacuum full would've brought the other database
>> back up to speed.

>
> I'm furious now that it didn't occur to me the vacuum full until
> *after* I had recreated the database to see th problem disappear...
>
> I wonder if I should then periodically run a vacuum full --- say, once
> a week? Once a month?


Never. What you need to do is make sure your FSM settings
(fsm_max_pages in particular) are high enough, and that you VACUUM (not
full) frequently enough.


--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 11:36 AM
Carlos Moreno
 
Posts: n/a
Default Re: Possible explanations for catastrophic performace deterioration?


>> I don't understand this argument --- the newer system has actually
>> less memory than the old one; how could it fit there and not on the
>> old one? Plus, how could dropping-recreating the database on the same
>> machine change the fact that the entire dataset entirely fit or not in
>> memory??

>
> Because on the older server it is bloated, while on the new one it is
> fresh thus no dead tuples.


Wait a second --- am I correct in understanding then that the bloating
you guys are referring to occurs *in memory*??

My mind has been operating under the assumption that bloating only
occurs on disk, and never in memory --- is there where my logic is
mistaken?

>> I wonder if I should then periodically run a vacuum full --- say, once
>> a week? Once a month?

>
> Never. What you need to do is make sure your FSM settings
> (fsm_max_pages in particular) are high enough, and that you VACUUM (not
> full) frequently enough.


Noted.

Thanks!

Carlos
--


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 11:36 AM
Jonah H. Harris
 
Posts: n/a
Default Re: Possible explanations for catastrophic performace deterioration?

On 9/23/07, Carlos Moreno <moreno_pg@mochima.com> wrote:
> Wait a second --- am I correct in understanding then that the bloating
> you guys are referring to occurs *in memory*??


No, bloating occurs on-disk; but this does affect memory. Bloat means
that even though your table data may take up 1G after the initial
load, due to poor vacuuming, table layouts, etc. it to equal something
more... say 2G.

The thing is, even though the table only stores 1G of data, it is now
physically 2G. So, anything that would need to read the entire table
(like COUNT(*)), or large sections of it sequentially, are performing
twice as many I/Os to do so. Which means you're actually waiting on
two things, I/O and additional CPU time reading blocks that have very
little viable data in them.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com
Edison, NJ 08837 | http://www.enterprisedb.com/

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 11:36 AM
Carlos Moreno
 
Posts: n/a
Default Re: Possible explanations for catastrophic performancedeterioration?

Jonah H. Harris wrote:
> On 9/23/07, Carlos Moreno <moreno_pg@mochima.com> wrote:
>> Wait a second --- am I correct in understanding then that the bloating
>> you guys are referring to occurs *in memory*??

>
> No, bloating occurs on-disk; but this does affect memory. Bloat means
> that even though your table data may take up 1G after the initial
> load, due to poor vacuuming, table layouts, etc. it to equal something
> more... say 2G.
>
> The thing is, even though the table only stores 1G of data, it is now
> physically 2G. So, anything that would need to read the entire table
> (like COUNT(*)), or large sections of it sequentially, are performing
> twice as many I/Os to do so.


OK --- that was my initial impression... But again, then I'm still puzzled
about why *the second time* that I load the query, it still take a few
seconds.

That is: the first time I run the query, it has to go through the disk;
in the normal case it would have to read 100MB of data, but due to
bloating,
it actually has to go through 2GB of data. Ok, but then, it will load
only 100MB (the ones that are not "uncollected disk garbage") to memory.
The next time that I run the query, the server would only need to read
100MB from memory --- the result should be instantaneous...

The behaviour I observed was: first time I run the query took over one
minute; second time, a little above two seconds. Tried four or five times
more; in every instance it was around 2 seconds. On the new server, *the
first time* I run the query, it takes *no time* (I repeat: *no time*
--- as
in perhaps 10 to 100 msec; in any case, my eyes could not resolve between
the moment I hit enter and the moment I see the result with the count of
rows --- that's between one and two orders of magnitude faster than with
the
old server --- and again, we're comparing *the first* time I execute the
query
on the new machine, in which case it is expected that it would have to read
from disk, compared to the second and subsequent times that I execute it on
the old machine, in which case, since the bloating does not occur in
memory,
the entire seq. scan should occur exclusively in memory ... )

That's what still puzzles me --- Alvaro's reply seemed to explain it if I
accept that the bloating affects memory (dead tuples loaded to memory
reduce
the capacity to load the entire dataset into memory)...

Someone could shed some light and point out if there's still something I'm
missing or some other mistake in my analysis?? Hope I'm not sounding like
I'm being dense!!

Thanks,

Carlos
--



---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 11:36 AM
Alvaro Herrera
 
Posts: n/a
Default Re: Possible explanations for catastrophic performancedeterioration?

Carlos Moreno wrote:

> That is: the first time I run the query, it has to go through the
> disk; in the normal case it would have to read 100MB of data, but due
> to bloating, it actually has to go through 2GB of data. Ok, but
> then, it will load only 100MB (the ones that are not "uncollected
> disk garbage") to memory. The next time that I run the query, the
> server would only need to read 100MB from memory --- the result should
> be instantaneous...


Wrong. If there is 2GB of data, 1900MB of which is dead tuples, those
pages would still have to be scanned for the count(*). The system does
not distinguish "pages which have no live tuples" from other pages, so
it has to load them all.

--
Alvaro Herrera http://www.amazon.com/gp/registry/CTMLCN8V17R4
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 11:36 AM
Carlos Moreno
 
Posts: n/a
Default Re: Possible explanations for catastrophic performancedeterioration?

Alvaro Herrera wrote:
> Carlos Moreno wrote:
>
>
>> That is: the first time I run the query, it has to go through the
>> disk; in the normal case it would have to read 100MB of data, but due
>> to bloating, it actually has to go through 2GB of data. Ok, but
>> then, it will load only 100MB (the ones that are not "uncollected
>> disk garbage") to memory. The next time that I run the query, the
>> server would only need to read 100MB from memory --- the result should
>> be instantaneous...

>
> Wrong. If there is 2GB of data, 1900MB of which is dead tuples, those
> pages would still have to be scanned for the count(*). The system does
> not distinguish "pages which have no live tuples" from other pages, so
> it has to load them all.


Yes, that part I understand --- I think I now know what the error is in
my logic. I was thinking as follows: We read 2GB of which 1900MB are
dead tuples. But then, once they're read, the system will only keep
in memory the 100MB that are valid tuples.

I'm now thinking that the problem with my logic is that the system does
not keep anything in memory (or not all tuples, in any case), since it
is only counting, so it does not *have to* keep them, and since the
total amount of reading from the disk exceeds the amount of physical
memory, then the valid tuples are "pushed out" of memory.

So, the second time I execute the query, it will still need to scan the
disk (in my mind, the way I was seeing it, the second time I execute
the "select count(*) from customer", the entire customer table would be
in memory from the previous time, and that's why I was thinking that
the bloating would not explain why the second time it is still slow).

Am I understanding it right?

Thanks for your patience!

Carlos
--



---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:02 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com