Unix Technical Forum

TB-sized databases

This is a discussion on TB-sized databases within the Pgsql Performance forums, part of the PostgreSQL category; --> Hi all, I have a user who is looking to store 500+ GB of data in a database (and ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Performance

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-19-2008, 11:42 AM
Peter Koczan
 
Posts: n/a
Default TB-sized databases

Hi all,

I have a user who is looking to store 500+ GB of data in a database
(and when all the indexes and metadata are factored in, it's going to
be more like 3-4 TB). He is wondering how well PostgreSQL scales with
TB-sized databases and what can be done to help optimize them (mostly
hardware and config parameters, maybe a little advocacy). I can't
speak on that since I don't have any DBs approaching that size.

The other part of this puzzle is that he's torn between MS SQL Server
(running on Windows and unsupported by us) and PostgreSQL (running on
Linux...which we would fully support). If any of you have ideas of how
well PostgreSQL compares to SQL Server, especially in TB-sized
databases, that would be much appreciated.

We're running PG 8.2.5, by the way.

Peter

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-19-2008, 11:42 AM
Joshua D. Drake
 
Posts: n/a
Default Re: TB-sized databases

Peter Koczan wrote:
> Hi all,
>
> I have a user who is looking to store 500+ GB of data in a database
> (and when all the indexes and metadata are factored in, it's going to
> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> TB-sized databases and what can be done to help optimize them (mostly
> hardware and config parameters, maybe a little advocacy). I can't
> speak on that since I don't have any DBs approaching that size.
>
> The other part of this puzzle is that he's torn between MS SQL Server
> (running on Windows and unsupported by us) and PostgreSQL (running on
> Linux...which we would fully support). If any of you have ideas of how
> well PostgreSQL compares to SQL Server, especially in TB-sized
> databases, that would be much appreciated.
>
> We're running PG 8.2.5, by the way.


Well I can't speak to MS SQL-Server because all of our clients run
PostgreSQL .. I can tell you we have many that are in the 500GB -
1.5TB range.

All perform admirably as long as you have the hardware behind it and are
doing correct table structuring (such as table partitioning).

Sincerely,

Joshua D. Drake


>
> Peter
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>



---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-19-2008, 11:42 AM
Oleg Bartunov
 
Posts: n/a
Default Re: TB-sized databases

We have several TB database in production and it works well on
HP rx1620 dual Itanium2, MSA 20, running Linux. It's read-only storage for
astronomical catalogs with about 4-billions objects. We have custom
index for spherical coordinates which provide great performance.

Oleg
On Mon, 26 Nov 2007, Peter Koczan wrote:

> Hi all,
>
> I have a user who is looking to store 500+ GB of data in a database
> (and when all the indexes and metadata are factored in, it's going to
> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> TB-sized databases and what can be done to help optimize them (mostly
> hardware and config parameters, maybe a little advocacy). I can't
> speak on that since I don't have any DBs approaching that size.
>
> The other part of this puzzle is that he's torn between MS SQL Server
> (running on Windows and unsupported by us) and PostgreSQL (running on
> Linux...which we would fully support). If any of you have ideas of how
> well PostgreSQL compares to SQL Server, especially in TB-sized
> databases, that would be much appreciated.
>
> We're running PG 8.2.5, by the way.
>
> Peter
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>


Regards,
Oleg
__________________________________________________ ___________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 04-19-2008, 11:42 AM
Pablo Alcaraz
 
Posts: n/a
Default Re: TB-sized databases

I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
The database simply colapsed. They switched to Teradata and it is
running good. This database has now 1.5Tb+.

Currently I have clients using postgresql huge databases and they are
happy. In one client's database the biggest table has 237Gb+ (only 1
table!) and postgresql run the database without problem using
partitioning, triggers and rules (using postgresql 8.2.5).

Pablo

Peter Koczan wrote:
> Hi all,
>
> I have a user who is looking to store 500+ GB of data in a database
> (and when all the indexes and metadata are factored in, it's going to
> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> TB-sized databases and what can be done to help optimize them (mostly
> hardware and config parameters, maybe a little advocacy). I can't
> speak on that since I don't have any DBs approaching that size.
>
> The other part of this puzzle is that he's torn between MS SQL Server
> (running on Windows and unsupported by us) and PostgreSQL (running on
> Linux...which we would fully support). If any of you have ideas of how
> well PostgreSQL compares to SQL Server, especially in TB-sized
> databases, that would be much appreciated.
>
> We're running PG 8.2.5, by the way.
>
> Peter
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>
>



---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 04-19-2008, 11:42 AM
Stephen Cook
 
Posts: n/a
Default Re: TB-sized databases

I think either would work; both PostgreSQL and MS SQL Server have
success stories out there running VLDBs. It really depends on what you
know and what you have. If you have a lot of experience with Postgres
running on Linux, and not much with SQL Server on Windows, of course the
former would be a better choice for you. You stand a much better chance
working with tools you know.


Pablo Alcaraz wrote:
> I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
> The database simply colapsed. They switched to Teradata and it is
> running good. This database has now 1.5Tb+.
>
> Currently I have clients using postgresql huge databases and they are
> happy. In one client's database the biggest table has 237Gb+ (only 1
> table!) and postgresql run the database without problem using
> partitioning, triggers and rules (using postgresql 8.2.5).
>
> Pablo
>
> Peter Koczan wrote:
>> Hi all,
>>
>> I have a user who is looking to store 500+ GB of data in a database
>> (and when all the indexes and metadata are factored in, it's going to
>> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
>> TB-sized databases and what can be done to help optimize them (mostly
>> hardware and config parameters, maybe a little advocacy). I can't
>> speak on that since I don't have any DBs approaching that size.
>>
>> The other part of this puzzle is that he's torn between MS SQL Server
>> (running on Windows and unsupported by us) and PostgreSQL (running on
>> Linux...which we would fully support). If any of you have ideas of how
>> well PostgreSQL compares to SQL Server, especially in TB-sized
>> databases, that would be much appreciated.
>>
>> We're running PG 8.2.5, by the way.
>>
>> Peter
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 4: Have you searched our list archives?
>>
>> http://archives.postgresql.org
>>
>>

>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 04-19-2008, 11:42 AM
Peter Koczan
 
Posts: n/a
Default Re: TB-sized databases

Thanks all. This is just what I needed.

On Nov 26, 2007 1:16 PM, Stephen Cook <sclists@gmail.com> wrote:
> I think either would work; both PostgreSQL and MS SQL Server have
> success stories out there running VLDBs. It really depends on what you
> know and what you have. If you have a lot of experience with Postgres
> running on Linux, and not much with SQL Server on Windows, of course the
> former would be a better choice for you. You stand a much better chance
> working with tools you know.
>
>
>
> Pablo Alcaraz wrote:
> > I had a client that tried to use Ms Sql Server to run a 500Gb+ database.
> > The database simply colapsed. They switched to Teradata and it is
> > running good. This database has now 1.5Tb+.
> >
> > Currently I have clients using postgresql huge databases and they are
> > happy. In one client's database the biggest table has 237Gb+ (only 1
> > table!) and postgresql run the database without problem using
> > partitioning, triggers and rules (using postgresql 8.2.5).
> >
> > Pablo
> >
> > Peter Koczan wrote:
> >> Hi all,
> >>
> >> I have a user who is looking to store 500+ GB of data in a database
> >> (and when all the indexes and metadata are factored in, it's going to
> >> be more like 3-4 TB). He is wondering how well PostgreSQL scales with
> >> TB-sized databases and what can be done to help optimize them (mostly
> >> hardware and config parameters, maybe a little advocacy). I can't
> >> speak on that since I don't have any DBs approaching that size.
> >>
> >> The other part of this puzzle is that he's torn between MS SQL Server
> >> (running on Windows and unsupported by us) and PostgreSQL (running on
> >> Linux...which we would fully support). If any of you have ideas of how
> >> well PostgreSQL compares to SQL Server, especially in TB-sized
> >> databases, that would be much appreciated.
> >>
> >> We're running PG 8.2.5, by the way.
> >>
> >> Peter
> >>
> >> ---------------------------(end of broadcast)---------------------------
> >> TIP 4: Have you searched our list archives?
> >>
> >> http://archives.postgresql.org
> >>
> >>

> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 5: don't forget to increase your free space map settings
> >

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 04-19-2008, 11:42 AM
Simon Riggs
 
Posts: n/a
Default Re: TB-sized databases

On Tue, 2007-11-27 at 14:18 -0600, Peter Koczan wrote:

> Thanks all. This is just what I needed.


All of those responses have cooked up quite a few topics into one. Large
databases might mean text warehouses, XML message stores, relational
archives and fact-based business data warehouses.

The main thing is that TB-sized databases are performance critical. So
it all depends upon your workload really as to how well PostgreSQL, or
another other RDBMS vendor can handle them.


Anyway, my reason for replying to this thread is that I'm planning
changes for PostgreSQL 8.4+ that will make allow us to get bigger and
faster databases. If anybody has specific concerns then I'd like to hear
them so I can consider those things in the planning stages.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 04-19-2008, 11:42 AM
Pablo Alcaraz
 
Posts: n/a
Default Re: TB-sized databases

Simon Riggs wrote:
> All of those responses have cooked up quite a few topics into one. Large
> databases might mean text warehouses, XML message stores, relational
> archives and fact-based business data warehouses.
>
> The main thing is that TB-sized databases are performance critical. So
> it all depends upon your workload really as to how well PostgreSQL, or
> another other RDBMS vendor can handle them.
>
>
> Anyway, my reason for replying to this thread is that I'm planning
> changes for PostgreSQL 8.4+ that will make allow us to get bigger and
> faster databases. If anybody has specific concerns then I'd like to hear
> them so I can consider those things in the planning stages

it would be nice to do something with selects so we can recover a rowset
on huge tables using a criteria with indexes without fall running a full
scan.

In my opinion, by definition, a huge database sooner or later will have
tables far bigger than RAM available (same for their indexes). I think
the queries need to be solved using indexes enough smart to be fast on disk.

Pablo

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #9 (permalink)  
Old 04-19-2008, 11:42 AM
Matthew
 
Posts: n/a
Default Re: TB-sized databases

On Tue, 27 Nov 2007, Pablo Alcaraz wrote:
> it would be nice to do something with selects so we can recover a rowset
> on huge tables using a criteria with indexes without fall running a full
> scan.


You mean: Be able to tell Postgres "Don't ever do a sequential scan of
this table. It's silly. I would rather the query failed than have to wait
for a sequential scan of the entire table."

Yes, that would be really useful, if you have huge tables in your
database.

Matthew

--
Trying to write a program that can't be written is... well, it can be an
enormous amount of fun! -- Computer Science Lecturer

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #10 (permalink)  
Old 04-19-2008, 11:42 AM
Alvaro Herrera
 
Posts: n/a
Default Re: TB-sized databases

Pablo Alcaraz escribió:

> In my opinion there are queries that I think they ll need to be tuned for
> "huge databases" (huge databases = a database which relevant
> tables(indexes) are (will be) far bigger that all the ram available):
>
> -- example table
> CREATE TABLE homes (
> id bigserial,
> name text,
> location text,
> bigint money_win,
> int zipcode;
> );
> CREATE INDEX money_win_idx ON homes(money_win);
> CREATE INDEX zipcode_idx ON homes(zipcode);


Your example does not work, so I created my own for your first item.

alvherre=# create table test (a int primary key);
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "test_pkey" for table "test"
CREATE TABLE
alvherre=# insert into test select * from generate_series(1, 100000);
INSERT 0 100000
alvherre=# analyze test;
ANALYZE

> SELECT max( id) from homes;
> I think the information to get the max row quickly could be found using the
> pk index. Idem min( id).


alvherre=# explain analyze select max(a) from test;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.03..0.04 rows=1 width=0) (actual time=0.054..0.057 rows=1 loops=1)
InitPlan
-> Limit (cost=0.00..0.03 rows=1 width=4) (actual time=0.041..0.043 rows=1 loops=1)
-> Index Scan Backward using test_pkey on test (cost=0.00..3148.26 rows=100000 width=4) (actual time=0.034..0.034 rows=1 loops=1)
Filter: (a IS NOT NULL)
Total runtime: 0.143 ms
(6 rows)


> SELECT max( id) from homes WHERE id > 8000000000;
> Same, but useful to find out the same thing in partitioned tables (using id
> like partition criteria). It would be nice if Postgres would not need the
> WHERE clause to realize it does not need to scan every single partition,
> but only the last. Idem min(id).


Yeah, this could be improved.

> SELECT * from homes WHERE money_win = 1300000000;
> Postgres thinks too easily to solve these kind of queries that it must to
> do a sequential scan where the table (or the index) does not fix in memory
> if the number of rows is not near 1 (example: if the query returns 5000
> rows). Same case with filters like 'WHERE money_win >= xx', 'WHERE
> money_win BETWEEN xx AND yy'. But I do not know if this behavior is because
> I did a wrong posgresql's configuration or I missed something.


There are thresholds to switch from index scan to seqscans. It depends
on the selectivity of the clauses.

> SELECT count( *) from homes;
> it would be *cute* that Postgres stores this value and only recalculate if
> it thinks the stored value is wrong (example: after an anormal shutdown).


This is not as easy as you put it for reasons that have been discussed
at length. I'll only say that there are workarounds to make counting
quick.

> SELECT zipcode, count( zipcode) as n from homes GROUP BY zipcode;
> it would be *very cute* that Postgres could store this value (or is this
> there?) on the index or wherever and it only recalculates if it thinks the
> stored value is wrong (example: after an anormal shutdown).


Same as above.


> Last but not least, it would be *excelent* that this kind of optimization
> would be posible without weird non standard sql sentences.


Right. If you can afford to sponsor development, it could make them a
reality sooner.

--
Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
"You're _really_ hosed if the person doing the hiring doesn't understand
relational systems: you end up with a whole raft of programmers, none of
whom has had a Date with the clue stick." (Andrew Sullivan)

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 06:32 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com