Unix Technical Forum

Performance problem with table containing a lot of text (blog)

This is a discussion on Performance problem with table containing a lot of text (blog) within the Pgsql Performance forums, part of the PostgreSQL category; --> Hello! Some background info.. We have a blog table that contains about eight million blog entries. Average length of ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 11:26 AM
Kari Lavikka
 
Posts: n/a
Default Performance problem with table containing a lot of text (blog)

Hello!

Some background info.. We have a blog table that contains about eight
million blog entries. Average length of an entry is 1200 letters. Because
each 8k page can accommodate only a few entries, every query that involves
several entries causes several random seeks to disk. We are having
problems with queries like:

1) give me a list of months when I have written someting
2) give me id's of entries I have written on month X year X
3) give me the number of blog entries my friends have written since last
time

Clustering would probably decrease random seeks but it is not an option.
It locks the table and operation would take "some" time. It should also be
done periodically to maintain clustering.

I guess that file system cache gets filled with text contents of blog
entries although they are totally useless for queries like these. Contents
of individual blog entries are cached to memcached on application level
anyway. There's rarely any need to fetch them from database.

It would be nice if I could flag a column to be toasted always, regardless
of it's length.

Because there isn't such option maybe I should create a separate table for
blog text content. Does anybody have better ideas for this?

Thanks!


P.S. Here's a plan for query #3. Users can have several bookmark groups
they are following. User can limit visibility of an entry to some of
his/her bookmark group. Those are not any kind of bottlenecks anyway...

Sort (cost=34112.60..34117.94 rows=2138 width=14)
Sort Key: count(*), upper((u.nick)::text)
-> HashAggregate (cost=33962.28..33994.35 rows=2138 width=14)
-> Nested Loop (cost=8399.95..33946.24 rows=2138 width=14)
-> Nested Loop (cost=8399.95..9133.16 rows=90 width=22)
-> HashAggregate (cost=8399.95..8402.32 rows=237 width=8)
-> Nested Loop (cost=0.00..8395.99 rows=792 width=8)
-> Index Scan using user_bookmark_uid on user_bookmark ub (cost=0.00..541.39 rows=2368 width=12)
Index Cond: (uid = 256979)
-> Index Scan using user_bookmark_group_pkey on user_bookmark_group bg (cost=0.00..3.30 rows=1 width=4)
Index Cond: ("outer".bookmark_group_id = bg.bookmark_group_id)
Filter: (("type" >= 0) AND ("type" <= 1) AND (trace_blog = 'y'::bpchar))
-> Index Scan using users_uid_accepted_only on users u (cost=0.00..3.06 rows=1 width=14)
Index Cond: (u.uid = "outer".marked_uid)
-> Index Scan using blog_entry_uid_beid on blog_entry be (cost=0.00..275.34 rows=24 width=8)
Index Cond: ((be.uid = "outer".marked_uid) AND (COALESCE("outer".last_seen_blog_entry_id, 0) < be.blog_entry_id))
Filter: ((visibility = 'p'::bpchar) AND ((status = 'p'::bpchar) OR (status = 'l'::bpchar)) AND ((bookmark_group_id IS NULL) OR (subplan)))
SubPlan
-> Index Scan using user_bookmark_pkey on user_bookmark fub (cost=0.00..3.42 rows=1 width=0)
Index Cond: ((bookmark_group_id = $0) AND (marked_uid = 256979))

P.S. That particular user has quite many unread entries though...

|\__/|
( oo ) Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 11:26 AM
Dan Harris
 
Posts: n/a
Default Re: Performance problem with table containing a lot oftext (blog)

Kari Lavikka wrote:
> Hello!
>
> Some background info.. We have a blog table that contains about eight
> million blog entries. Average length of an entry is 1200 letters.
> Because each 8k page can accommodate only a few entries, every query
> that involves several entries causes several random seeks to disk. We
> are having problems with queries like:
>
> 1) give me a list of months when I have written someting
> 2) give me id's of entries I have written on month X year X
> 3) give me the number of blog entries my friends have written since last
> time


I didn't see your schema, but couldn't these problems be solved by storing the
article id, owner id, and blog date in a separate table? It seems that if you
don't actually need the content of the blogs, all of those questions could be
answered by querying a very simple table with minimal I/O overhead.



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 11:26 AM
Kari Lavikka
 
Posts: n/a
Default Re: Performance problem with table containing a lot oftext (blog)


> I didn't see your schema, but couldn't these problems be solved by storing
> the article id, owner id, and blog date in a separate table? It seems that
> if you don't actually need the content of the blogs, all of those questions
> could be answered by querying a very simple table with minimal I/O overhead.


Yes. I was suggesting this as an option but I'm wondering if there
are other solutions.

|\__/|
( oo ) Kari Lavikka - tuner@bdb.fi - (050) 380 3808
__ooO( )Ooo_______ _____ ___ _ _ _ _ _ _ _
""

On Tue, 28 Aug 2007, Dan Harris wrote:

> Kari Lavikka wrote:
>> Hello!
>>
>> Some background info.. We have a blog table that contains about eight
>> million blog entries. Average length of an entry is 1200 letters. Because
>> each 8k page can accommodate only a few entries, every query that involves
>> several entries causes several random seeks to disk. We are having
>> problems with queries like:
>>
>> 1) give me a list of months when I have written someting
>> 2) give me id's of entries I have written on month X year X
>> 3) give me the number of blog entries my friends have written since last
>> time

>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 11:27 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Performance problem with table containing a lot of text (blog)

Kari Lavikka wrote:
> It would be nice if I could flag a column to be toasted always,
> regardless of it's length.


The idea of being able to set the toast threshold per column was
discussed during 8.3 development, but no patch was produced IIRC. We
might do that in the future. If you're willing to compile from source,
you can lower TOAST_TUPLE_THRESHOLD.

You could also use ALTER TABLE ... ALTER COLUMN ... SET STORAGE EXTERNAL
to force the long blog entries to be stored in the toast table instead
of compressing them in the main table. Values smaller than
TOAST_TUPLE_THRESHOLD (2k by default?) still wouldn't be toasted,
though, so it might not make much difference.

> Because there isn't such option maybe I should create a separate table
> for blog text content. Does anybody have better ideas for this?


That's probably the easiest solution. You can put a view on top of them
to hide it from the application.

> P.S. Here's a plan for query #3. Users can have several bookmark groups
> they are following. User can limit visibility of an entry to some of
> his/her bookmark group. Those are not any kind of bottlenecks anyway...


If the user_bookmark table is not clustered by uid, I'm surprised the
planner didn't choose a bitmap index scan. Which version of PostgreSQL
is this?

PS. EXPLAIN ANALYZE is much more helpful than plain EXPLAIN.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 11:27 AM
Kari Lavikka
 
Posts: n/a
Default Re: Performance problem with table containing a lot oftext (blog)

On Wed, 29 Aug 2007, Heikki Linnakangas wrote:

> The idea of being able to set the toast threshold per column was
> discussed during 8.3 development, but no patch was produced IIRC. We
> might do that in the future. If you're willing to compile from source,
> you can lower TOAST_TUPLE_THRESHOLD.


We are currently using Postgres 8.1 but have to upgrade to 8.2 shortly.
New version fixes some vacuum problems.

I always compile postgres from source. Maybe I have to do some
calculations because that setting affects all tables and databases. Most
of our text/varchar columns are quite short but setting the threshold too
low causes excessive seeks to toast tables... right?

>> Because there isn't such option maybe I should create a separate table
>> for blog text content. Does anybody have better ideas for this?

>
> That's probably the easiest solution. You can put a view on top of them
> to hide it from the application.


Yeh.

> If the user_bookmark table is not clustered by uid, I'm surprised the
> planner didn't choose a bitmap index scan.


Drumroll... there are:
"user_bookmark_pkey" PRIMARY KEY, btree (bookmark_group_id, marked_uid), tablespace "lun3"
"user_bookmark_marked_uid" btree (marked_uid)
"user_bookmark_uid" btree (uid) CLUSTER, tablespace "lun3"

Queries are mostly like "Gimme all of my bookmarked friends in all of my
bookmark groups" and rarely the opposite "Gimme all users who have
bookmarked me"

I have clustered the table using uid to minimize random page fetches.

- Kari

>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 11:27 AM
Heikki Linnakangas
 
Posts: n/a
Default Re: Performance problem with table containing a lot of text (blog)

Kari Lavikka wrote:
> On Wed, 29 Aug 2007, Heikki Linnakangas wrote:
>
>> The idea of being able to set the toast threshold per column was
>> discussed during 8.3 development, but no patch was produced IIRC. We
>> might do that in the future. If you're willing to compile from source,
>> you can lower TOAST_TUPLE_THRESHOLD.

>
> We are currently using Postgres 8.1 but have to upgrade to 8.2 shortly.
> New version fixes some vacuum problems.
>
> I always compile postgres from source. Maybe I have to do some
> calculations because that setting affects all tables and databases. Most
> of our text/varchar columns are quite short but setting the threshold
> too low causes excessive seeks to toast tables... right?


Right. If you have trouble finding the right balance, you can also use
ALTER STORAGE PLAIN to force the other columns not to be toasted.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:00 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com