Unix Technical Forum

Re: [HACKERS] O_DIRECT for WAL writes

This is a discussion on Re: [HACKERS] O_DIRECT for WAL writes within the Pgsql Patches forums, part of the PostgreSQL category; --> I have modified and attached your patch for your review. I didn't see any value to adding new fsync_method ...


Go Back   Unix Technical Forum > Database Server Software > PostgreSQL > Pgsql Patches

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 04-18-2008, 12:43 AM
Bruce Momjian
 
Posts: n/a
Default Re: [HACKERS] O_DIRECT for WAL writes


I have modified and attached your patch for your review. I didn't see
any value to adding new fsync_method values because, to me, O_DIRECT is
basically just like O_SYNC except it doesn't keep a copy of the buffer
in the kernel cache. If you are doing fsync(), I don't see how O_DIRECT
makes any sense because O_DIRECT is writing to disk on every write, and
then what is the fsync() actually doing. This might explain why your
fsync/direct and open/direct performance numbers are almost identical.
Basically, if you are going to use O_DIRECT, why not use open_sync.

What I did was to add O_DIRECT unconditionally for all uses of O_SYNC
and O_DSYNC, so it is automatically used in those cases. And of course,
if your operating system doens't support O_DIRECT, it isn't used.

With your posted performance numbers, perhaps we should favor
fsync_method O_SYNC on platforms that have O_DIRECT even if we don't
support OPEN_DATASYNC, but I bet most platforms that have O_DIRECT also
have O_DATASYNC. Perhaps some folks can run testes once the patch is
applied.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > Yeah, this is about what I was afraid of: if you're actually fsyncing
> > then you get at best one commit per disk revolution, and the negotiation
> > with the OS is down in the noise.

>
> If we disable writeback-cache and use open_sync, the per-page writing
> behavior in WAL module will show up as bad result. O_DIRECT is similar
> to O_DSYNC (at least on linux), so that the benefit of it will disappear
> behind the slow disk revolution.
>
> In the current source, WAL is written as:
> for (i = 0; i < N; i++) { write(&buffers[i], BLCKSZ); }
> Is this intentional? Can we rewrite it as follows?
> write(&buffers[0], N * BLCKSZ);
>
> In order to achieve it, I wrote a 'gather-write' patch (xlog.gw.diff).
> Aside from this, I'll also send the fixed direct io patch (xlog.dio.diff).
> These two patches are independent, so they can be applied either or both.
>
>
> I tested them on my machine and the results as follows. It shows that
> direct-io and gather-write is the best choice when writeback-cache is off.
> Are these two patches worth trying if they are used together?
>
>
> | writeback | fsync= | fdata | open_ | fsync_ | open_
> patch | cache | false | sync | sync | direct | direct
> ------------+-----------+--------+-------+-------+--------+---------
> direct io | off | 124.2 | 105.7 | 48.3 | 48.3 | 48.2
> direct io | on | 129.1 | 112.3 | 114.1 | 142.9 | 144.5
> gather-write| off | 124.3 | 108.7 | 105.4 | (N/A) | (N/A)
> both | off | 131.5 | 115.5 | 114.4 | 145.4 | 145.2
>
> - 20runs * pgbench -s 100 -c 50 -t 200
> - with tuning (wal_buffers=64, commit_delay=500, checkpoint_segments=8)
> - using 2 ATA disks:
> - hda(reiserfs) includes system and wal.
> - hdc(jfs) includes database files. writeback-cache is always on.
>
> ---
> ITAGAKI Takahiro
> NTT Cyber Space Laboratories
>


[ Attachment, skipping... ]

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq


--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

Index: src/backend/access/transam/xlog.c
================================================== =================
RCS file: /cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.210
diff -c -c -r1.210 xlog.c
*** src/backend/access/transam/xlog.c 23 Jul 2005 15:31:16 -0000 1.210
--- src/backend/access/transam/xlog.c 23 Jul 2005 16:09:12 -0000
***************
*** 48,77 ****


/*
* This chunk of hackery attempts to determine which file sync methods
* are available on the current platform, and to choose an appropriate
* default method. We assume that fsync() is always available, and that
* configure determined whether fdatasync() is.
*/
#if defined(O_SYNC)
! #define OPEN_SYNC_FLAG O_SYNC
#else
#if defined(O_FSYNC)
! #define OPEN_SYNC_FLAG O_FSYNC
#endif
#endif

#if defined(O_DSYNC)
#if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG O_DSYNC
#endif
#else /* !defined(OPEN_SYNC_FLAG) */
/* Win32 only has O_DSYNC */
! #define OPEN_DATASYNC_FLAG O_DSYNC
#endif
#endif

#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
--- 48,114 ----


/*
+ * Becauase O_DIRECT bypasses the kernel buffers, and because we never
+ * read those buffers except during crash recovery, it seems like
+ * a win to use it in all cases where we sync on each write().
+ */
+ #ifdef O_DIRECT
+ #define PG_O_DIRECT O_DIRECT
+ #else
+ #define PG_O_DIRECT 0
+ #endif
+
+ /*
* This chunk of hackery attempts to determine which file sync methods
* are available on the current platform, and to choose an appropriate
* default method. We assume that fsync() is always available, and that
* configure determined whether fdatasync() is.
*/
#if defined(O_SYNC)
! #define CMP_OPEN_SYNC_FLAG O_SYNC
#else
#if defined(O_FSYNC)
! #define CMP_OPEN_SYNC_FLAG O_FSYNC
#endif
#endif
+ #define OPEN_SYNC_FLAG (CMP_OPEN_SYNC_FLAG | PG_O_DIRECT)

#if defined(O_DSYNC)
#if defined(OPEN_SYNC_FLAG)
! #if O_DSYNC != CMP_OPEN_SYNC_FLAG
! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
#endif
#else /* !defined(OPEN_SYNC_FLAG) */
/* Win32 only has O_DSYNC */
! #define OPEN_DATASYNC_FLAG (O_DSYNC | PG_O_DIRECT)
#endif
#endif

+ /*
+ * Limitation of buffer-alignment for direct io depend on OS and filesystem,
+ * but BLCKSZ is assumed to be enough for it.
+ */
+ #ifdef O_DIRECT
+ #define ALIGNOF_XLOG_BUFFER BLCKSZ
+ #else
+ #define ALIGNOF_XLOG_BUFFER MAXIMUM_ALIGNOF
+ #endif
+
+ /*
+ * Switch the alignment routine because ShmemAlloc() returns a max-aligned
+ * buffer and ALIGNOF_XLOG_BUFFER may be greater than MAXIMUM_ALIGNOF.
+ */
+ #if ALIGNOF_XLOG_BUFFER <= MAXIMUM_ALIGNOF
+ #define XLOG_BUFFER_ALIGN(LEN) MAXALIGN((LEN))
+ #else
+ #define XLOG_BUFFER_ALIGN(LEN) ((LEN) + (ALIGNOF_XLOG_BUFFER))
+ #endif
+ /* assume sizeof(ptrdiff_t) == sizeof(void*) */
+ #define POINTERALIGN(ALIGNVAL,PTR) \
+ ((char *)(((ptrdiff_t) (PTR) + (ALIGNVAL-1)) & ~((ptrdiff_t) (ALIGNVAL-1))))
+ #define XLOG_BUFFER_POINTERALIGN(PTR) \
+ POINTERALIGN((ALIGNOF_XLOG_BUFFER), (PTR))
+
#if defined(OPEN_DATASYNC_FLAG)
#define DEFAULT_SYNC_METHOD_STR "open_datasync"
#define DEFAULT_SYNC_METHOD SYNC_METHOD_OPEN
***************
*** 469,474 ****
--- 506,522 ----
static char *str_time(time_t tnow);
static void issue_xlog_fsync(void);

+ /* XLog gather-write staffs */
+ typedef struct XLogPages
+ {
+ char *head; /* Head of first page */
+ int size; /* Total bytes of pages == count(pages) * BLCKSZ */
+ int offset; /* Offset in xlog segment file */
+ } XLogPages;
+ static void XLogPageReset(XLogPages *pages);
+ static void XLogPageWrite(XLogPages *pages, int index);
+ static void XLogPageFlush(XLogPages *pages, int index);
+
#ifdef WAL_DEBUG
static void xlog_outrec(char *buf, XLogRecord *record);
#endif
***************
*** 1245,1253 ****
XLogWrite(XLogwrtRqst WriteRqst)
{
XLogCtlWrite *Write = &XLogCtl->Write;
- char *from;
bool ispartialpage;
bool use_existent;

/* We should always be inside a critical section here */
Assert(CritSectionCount > 0);
--- 1293,1302 ----
XLogWrite(XLogwrtRqst WriteRqst)
{
XLogCtlWrite *Write = &XLogCtl->Write;
bool ispartialpage;
bool use_existent;
+ int currentIndex = Write->curridx;
+ XLogPages pages;

/* We should always be inside a critical section here */
Assert(CritSectionCount > 0);
***************
*** 1258,1263 ****
--- 1307,1314 ----
*/
LogwrtResult = Write->LogwrtResult;

+ XLogPageReset(&pages);
+
while (XLByteLT(LogwrtResult.Write, WriteRqst.Write))
{
/*
***************
*** 1266,1279 ****
* end of the last page that's been initialized by
* AdvanceXLInsertBuffer.
*/
! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[Write->curridx]))
elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! XLogCtl->xlblocks[Write->curridx].xlogid,
! XLogCtl->xlblocks[Write->curridx].xrecoff);

/* Advance LogwrtResult.Write to end of current buffer page */
! LogwrtResult.Write = XLogCtl->xlblocks[Write->curridx];
ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);

if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
--- 1317,1330 ----
* end of the last page that's been initialized by
* AdvanceXLInsertBuffer.
*/
! if (!XLByteLT(LogwrtResult.Write, XLogCtl->xlblocks[currentIndex]))
elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
LogwrtResult.Write.xlogid, LogwrtResult.Write.xrecoff,
! XLogCtl->xlblocks[currentIndex].xlogid,
! XLogCtl->xlblocks[currentIndex].xrecoff);

/* Advance LogwrtResult.Write to end of current buffer page */
! LogwrtResult.Write = XLogCtl->xlblocks[currentIndex];
ispartialpage = XLByteLT(WriteRqst.Write, LogwrtResult.Write);

if (!XLByteInPrevSeg(LogwrtResult.Write, openLogId, openLogSeg))
***************
*** 1281,1286 ****
--- 1332,1338 ----
/*
* Switch to new logfile segment.
*/
+ XLogPageFlush(&pages, currentIndex);
if (openLogFile >= 0)
{
if (close(openLogFile))
***************
*** 1354,1384 ****
openLogOff = 0;
}

! /* Need to seek in the file? */
! if (openLogOff != (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize)
! {
! openLogOff = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
! if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
! ereport(PANIC,
! (errcode_for_file_access(),
! errmsg("could not seek in log file %u, segment %u to offset %u: %m",
! openLogId, openLogSeg, openLogOff)));
! }
!
! /* OK to write the page */
! from = XLogCtl->pages + Write->curridx * BLCKSZ;
! errno = 0;
! if (write(openLogFile, from, BLCKSZ) != BLCKSZ)
! {
! /* if write didn't set errno, assume problem is no disk space */
! if (errno == 0)
! errno = ENOSPC;
! ereport(PANIC,
! (errcode_for_file_access(),
! errmsg("could not write to log file %u, segment %u at offset %u: %m",
! openLogId, openLogSeg, openLogOff)));
! }
! openLogOff += BLCKSZ;

/*
* If we just wrote the whole last page of a logfile segment,
--- 1406,1413 ----
openLogOff = 0;
}

! /* Add a page to buffer */
! XLogPageWrite(&pages, currentIndex);

/*
* If we just wrote the whole last page of a logfile segment,
***************
*** 1390,1397 ****
* This is also the right place to notify the Archiver that the
* segment is ready to copy to archival storage.
*/
! if (openLogOff >= XLogSegSize && !ispartialpage)
{
issue_xlog_fsync();
LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */

--- 1419,1427 ----
* This is also the right place to notify the Archiver that the
* segment is ready to copy to archival storage.
*/
! if (openLogOff + pages.size >= XLogSegSize && !ispartialpage)
{
+ XLogPageFlush(&pages, currentIndex);
issue_xlog_fsync();
LogwrtResult.Flush = LogwrtResult.Write; /* end of current page */

***************
*** 1405,1412 ****
LogwrtResult.Write = WriteRqst.Write;
break;
}
! Write->curridx = NextBufIdx(Write->curridx);
}

/*
* If asked to flush, do so
--- 1435,1443 ----
LogwrtResult.Write = WriteRqst.Write;
break;
}
! currentIndex = NextBufIdx(currentIndex);
}
+ XLogPageFlush(&pages, currentIndex);

/*
* If asked to flush, do so
***************
*** 3584,3590 ****
if (XLOGbuffers < MinXLOGbuffers)
XLOGbuffers = MinXLOGbuffers;

! return MAXALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers +
MAXALIGN(sizeof(ControlFileData));
}
--- 3615,3621 ----
if (XLOGbuffers < MinXLOGbuffers)
XLOGbuffers = MinXLOGbuffers;

! return XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers +
MAXALIGN(sizeof(ControlFileData));
}
***************
*** 3601,3607 ****

XLogCtl = (XLogCtlData *)
ShmemInitStruct("XLOG Ctl",
! MAXALIGN(sizeof(XLogCtlData) +
sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers,
&foundXLog);
--- 3632,3638 ----

XLogCtl = (XLogCtlData *)
ShmemInitStruct("XLOG Ctl",
! XLOG_BUFFER_ALIGN(sizeof(XLogCtlData) +
sizeof(XLogRecPtr) * XLOGbuffers)
+ BLCKSZ * XLOGbuffers,
&foundXLog);
***************
*** 3630,3638 ****
* Here, on the other hand, we must MAXALIGN to ensure the page
* buffers have worst-case alignment.
*/
! XLogCtl->pages =
! ((char *) XLogCtl) + MAXALIGN(sizeof(XLogCtlData) +
! sizeof(XLogRecPtr) * XLOGbuffers);
memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);

/*
--- 3661,3669 ----
* Here, on the other hand, we must MAXALIGN to ensure the page
* buffers have worst-case alignment.
*/
! XLogCtl->pages = XLOG_BUFFER_POINTERALIGN(
! ((char *) XLogCtl)
! + sizeof(XLogCtlData) + sizeof(XLogRecPtr) * XLOGbuffers);
memset(XLogCtl->pages, 0, BLCKSZ * XLOGbuffers);

/*
***************
*** 3662,3667 ****
--- 3693,3699 ----
{
CheckPoint checkPoint;
char *buffer;
+ char buffer0[BLCKSZ + ALIGNOF_XLOG_BUFFER];
XLogPageHeader page;
XLogLongPageHeader longpage;
XLogRecord *record;
***************
*** 3690,3697 ****
/* First timeline ID is always 1 */
ThisTimeLineID = 1;

! /* Use malloc() to ensure buffer is MAXALIGNED */
! buffer = (char *) malloc(BLCKSZ);
page = (XLogPageHeader) buffer;
memset(buffer, 0, BLCKSZ);

--- 3722,3728 ----
/* First timeline ID is always 1 */
ThisTimeLineID = 1;

! buffer = XLOG_BUFFER_POINTERALIGN(buffer0);
page = (XLogPageHeader) buffer;
memset(buffer, 0, BLCKSZ);

***************
*** 5837,5839 ****
--- 5868,5938 ----
errmsg("could not remove file \"%s\": %m",
BACKUP_LABEL_FILE)));
}
+
+
+ /* XLog gather-write staffs */
+
+ static void
+ XLogPageReset(XLogPages *pages)
+ {
+ memset(pages, 0, sizeof(*pages));
+ }
+
+ static void
+ XLogPageWrite(XLogPages *pages, int index)
+ {
+ char *page = XLogCtl->pages + index * BLCKSZ;
+ int size = BLCKSZ;
+ int offset = (LogwrtResult.Write.xrecoff - BLCKSZ) % XLogSegSize;
+
+ if (pages->head + pages->size == page
+ && pages->offset + pages->size == offset)
+ { /* Pages are continuous. Append new page. */
+ pages->size += size;
+ }
+ else
+ { /* Pages are not continuous. Flush and clear. */
+ XLogPageFlush(pages, PrevBufIdx(index));
+ pages->head = page;
+ pages->size = size;
+ pages->offset = offset;
+ }
+ }
+
+ static void
+ XLogPageFlush(XLogPages *pages, int index)
+ {
+ if (!pages->head)
+ { /* No needs to write pages. */
+ XLogCtl->Write.curridx = index;
+ return;
+ }
+
+ /* Need to seek in the file? */
+ if (openLogOff != pages->offset)
+ {
+ openLogOff = pages->offset;
+ if (lseek(openLogFile, (off_t) openLogOff, SEEK_SET) < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not seek in log file %u, segment %u to offset %u: %m",
+ openLogId, openLogSeg, openLogOff)));
+ }
+
+ /* OK to write the page */
+ errno = 0;
+ if (write(openLogFile, pages->head, pages->size) != pages->size)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write to log file %u, segment %u at offset %u: %m",
+ openLogId, openLogSeg, openLogOff)));
+ }
+
+ openLogOff += pages->size;
+ XLogCtl->Write.curridx = index;
+ XLogPageReset(pages);
+ }


---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 04-18-2008, 12:48 AM
Mark Wong
 
Posts: n/a
Default Re: [HACKERS] O_DIRECT for WAL writes

Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
I'm not sure if I'm showing anything insightful though. I've learned
that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
Linux, which you guys may have already known. There appears to be a
fair performance decrease in using open_sync. Just to double check, am
I correct in understanding only open_sync uses O_DIRECT?

fdatasync
http://www.testing.osdl.org/projects...s/dev4-015/38/
5462 notpm

open_sync
http://www.testing.osdl.org/projects...s/dev4-015/40/
4860 notpm

Mark

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 04-18-2008, 12:48 AM
Bruce Momjian
 
Posts: n/a
Default Re: [HACKERS] O_DIRECT for WAL writes

Mark Wong wrote:
> Ok, I finally got a couple of tests done against CVS from Aug 3, 2005.
> I'm not sure if I'm showing anything insightful though. I've learned
> that fdatasync and O_DSYNC are simply fsync and O_SYNC respectively on
> Linux, which you guys may have already known. There appears to be a


That is not what we thought for Linux, but many other OS's behave that
way.

> fair performance decrease in using open_sync. Just to double check, am
> I correct in understanding only open_sync uses O_DIRECT?


Right.

> fdatasync
> http://www.testing.osdl.org/projects...s/dev4-015/38/
> 5462 notpm
>
> open_sync
> http://www.testing.osdl.org/projects...s/dev4-015/40/
> 4860 notpm


Right now open_sync is our last choice, which seems to still be valid
for Linux, at least.

--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 07:24 PM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com