SEO

#1 (**permalink**) 03-07-2008, 02:27 PM

We have a large app that's built on
Linux, Solaris, & AIX. The app uses
many threads (pthreads API).

There is a bug in the app that shows itself
only on AIX. All (seems to) work OK on
Linux & Solaris. On AIX some threads trap
at random (i.e. there is no recognizable
pattern to the traps).

The app has a signal handler that catches
all signals and restarts the offending
thread; however, the data for the transaction
that was being processed when the trap
occurred is lost.

We disabled the program's signal handler
and ran it under DBX (dbx -r). The program
hangs without doing any work.

Thread 0 starts many child threads in a loop.
It starts a child thread, then calls
pthread_cond_timedwait() to wait on notification
from the thread that it has finished its initial
setup (ODBC database connections, etc) and is ready
to accept work.

The child thread calls pthread_cond_signal()
to notify thread 0 when its initial setup
is complete. At that time thread 0 loops
and starts the next child thread - again
calling pthread_cond_timedwait() to wait
for the child thead's "I'm started"
notification.

Under DBX, thread 0 appears to hang on
the pthread_cond_signal() call - i.e.
the pthread_cond_signal() never returns.

Any ideas???

#2 (**permalink**) 03-09-2008, 01:33 PM

On Mar 6, 7:21*am, Larry Smith <[email protected]> wrote:
> We have a large app that's built on
> Linux, Solaris, & AIX. *The app uses
> many threads (pthreads API).
>
> There is a bug in the app that shows itself
> only on AIX. *All (seems to) work OK on
> Linux & Solaris. *On AIX some threads trap
> at random (i.e. there is no recognizable
> pattern to the traps).
>
> The app has a signal handler that catches
> all signals and restarts the offending
> thread; however, the data for the transaction
> that was being processed when the trap
> occurred is lost.
>
> We disabled the program's signal handler
> and ran it under DBX (dbx -r). *The program
> hangs without doing any work.
>
> Thread 0 starts many child threads in a loop.
> It starts a child thread, then calls
> pthread_cond_timedwait() to wait on notification
> from the thread that it has finished its initial
> setup (ODBC database connections, etc) and is ready
> to accept work.
>
> The child thread calls pthread_cond_signal()
> to notify thread 0 when its initial setup
> is complete. *At that time thread 0 loops
> and starts the next child thread - again
> calling pthread_cond_timedwait() to wait
> for the child thead's "I'm started"
> notification.
>
> Under DBX, thread 0 appears to hang on
> the pthread_cond_signal() call - i.e.
> the pthread_cond_signal() never returns.
>
> Any ideas???

No idea if this is your problem, but we found that "alignment" of
pthread structures
is extremely important. We were compiling programs on AIX (5.1, &
5.3) with the
"-qalign=packed" parameter to IBM's Visual Age 'C' compiler. These
'C' programs
include pthread structures (such as pthread_mutex_t, pthread_attr_t,
pthread_t, & pthread_cond_t)
within other structures. We experienced similar symptoms... some
programs would
appear to "lose" signals and sometimes have what appears to be hanging
threads.

The problem was solved by moving the pthread structures "out" of other
structures,
or by moving them to be the first members of the other
structures....or by surrounding
the structure containing pthread structures with a PRAGMA (#pragma
options align=natural)

Of course, your milage may vary.

-tony

#3 (**permalink**) 03-09-2008, 01:33 PM

[email protected] wrote:
> On Mar 6, 7:21 am, Larry Smith <[email protected]> wrote:
>> We have a large app that's built on
>> Linux, Solaris, & AIX. The app uses
>> many threads (pthreads API).
>>
>> There is a bug in the app that shows itself
>> only on AIX. All (seems to) work OK on
>> Linux & Solaris. On AIX some threads trap
>> at random (i.e. there is no recognizable
>> pattern to the traps).
>>
>> The app has a signal handler that catches
>> all signals and restarts the offending
>> thread; however, the data for the transaction
>> that was being processed when the trap
>> occurred is lost.
>>
>> We disabled the program's signal handler
>> and ran it under DBX (dbx -r). The program
>> hangs without doing any work.
>>
>> Thread 0 starts many child threads in a loop.
>> It starts a child thread, then calls
>> pthread_cond_timedwait() to wait on notification
>> from the thread that it has finished its initial
>> setup (ODBC database connections, etc) and is ready
>> to accept work.
>>
>> The child thread calls pthread_cond_signal()
>> to notify thread 0 when its initial setup
>> is complete. At that time thread 0 loops
>> and starts the next child thread - again
>> calling pthread_cond_timedwait() to wait
>> for the child thead's "I'm started"
>> notification.
>>
>> Under DBX, thread 0 appears to hang on
>> the pthread_cond_signal() call - i.e.
>> the pthread_cond_signal() never returns.
>>
>> Any ideas???
>
> No idea if this is your problem, but we found that "alignment" of
> pthread structures
> is extremely important. We were compiling programs on AIX (5.1, &
> 5.3) with the
> "-qalign=packed" parameter to IBM's Visual Age 'C' compiler. These
> 'C' programs
> include pthread structures (such as pthread_mutex_t, pthread_attr_t,
> pthread_t, & pthread_cond_t)
> within other structures. We experienced similar symptoms... some
> programs would
> appear to "lose" signals and sometimes have what appears to be hanging
> threads.
>
> The problem was solved by moving the pthread structures "out" of other
> structures,
> or by moving them to be the first members of the other
> structures....or by surrounding
> the structure containing pthread structures with a PRAGMA (#pragma
> options align=natural)
>
> Of course, your milage may vary.
>
> -tony

Thanks for the info...

We do not use -qalign, so we're getting the
platform defaults.

DBX has been no help. The app normally
comes up & starts its 30+ threads in a
few seconds. Under DBX it took six
minutes to get just the first two threads
up. So, if the issue is related to thread
contention, we'll never see it under DBX
because everything has slowed down so much.
We do not have this kind of performance
penalty running under the Linux 'kdbg'
debugger. However, the app never traps
on Linux or Solaris - only on AIX.
"electric fence" does not show any buffer
overruns.

We're stumped on the AIX issues.

#4 (**permalink**) 03-09-2008, 01:33 PM

Larry Smith <[email protected]> writes:

> We have a large app that's built on
> Linux, Solaris, & AIX. The app uses
> many threads (pthreads API).

Have you checked the application with Valgrind (Linux), or Purify
(Solaris and AIX)? No point in spending time guessing/debugging,
unless you know that the application is "squeaky clean".

> There is a bug in the app that shows itself
> only on AIX. All (seems to) work OK on
> Linux & Solaris. On AIX some threads trap
> at random (i.e. there is no recognizable
> pattern to the traps).
>
> The app has a signal handler that catches
> all signals and restarts the offending
> thread;

You can't restart a thread. If one thread of the application gets
a fatal signal, there is absolutely no reasonable way to recover
from that. I assume you mean you restart the whole process.

> We disabled the program's signal handler
> and ran it under DBX (dbx -r). The program
> hangs without doing any work.

If dbx interfers so much with your application, you should be able
to let it core-dump, and debug it post-mortem. What are some of
the crash stack traces?

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.

#5 (**permalink**) 03-09-2008, 01:33 PM

Paul Pluzhnikov wrote:
> Larry Smith <[email protected]> writes:
>
>> We have a large app that's built on
>> Linux, Solaris, & AIX. The app uses
>> many threads (pthreads API).
>
> Have you checked the application with Valgrind (Linux), or Purify
> (Solaris and AIX)? No point in spending time guessing/debugging,
> unless you know that the application is "squeaky clean".
>
>> There is a bug in the app that shows itself
>> only on AIX. All (seems to) work OK on
>> Linux & Solaris. On AIX some threads trap
>> at random (i.e. there is no recognizable
>> pattern to the traps).
>>
>> The app has a signal handler that catches
>> all signals and restarts the offending
>> thread;
>
> You can't restart a thread. If one thread of the application gets
> a fatal signal, there is absolutely no reasonable way to recover
> from that. I assume you mean you restart the whole process.
>
>> We disabled the program's signal handler
>> and ran it under DBX (dbx -r). The program
>> hangs without doing any work.
>
> If dbx interfers so much with your application, you should be able
> to let it core-dump, and debug it post-mortem. What are some of
> the crash stack traces?
>
> Cheers,

No, they "restart" the thread.

Each thread does a setjmp(), registers
the jmpbuf with a global handler, then
goes into a processing loop.
When a trap occurs (mem access, math error,
etc) the global handler looks up the approp
jmpbuf for the thread then uses it to do a
longjmp - "restarting" the thread, which repeats
the setjmp() and re-enters it process loop.

The threads "restart" because much of the
data is generated by customer-written
plugins & may have issues. They've used
the setjmp/longjmp "restart" logic on
Windows for over 15 years...

This is a huge financial app (100's of DLL's,
100+ exe's) written for Windows. It's been
ported to 64 bit Linux, Solaris, & AIX by writing
Unix versions of the Windows API's used by
the app (e.g. xxThreadStart() wraps
pthread_create() on Unix, but wraps
_beginthreadex() on Windows; xxMutexCreate()
wraps CreateMutex() on Windows, but wraps
pthread_mutex_init() on Unix; etc, etc).
There are hundreds of these "wrapper" API's.

Valgrind & Purify - yes.

The server side (where the issues are)
recv's blocks of BINARY data created
by Windows clients (8K up to 400K).
These blocks contain the memory images of
lists of heterogeneous "packed" 'C' struct's
created on the Windows clients. The Windows
version of the server code uses these
binary blocks "as is". The Unix version
of the server code has to pick this data apart
and rebuild the same 'C' struct's
in the server's native layout.
The server processes these "transaction
requests" by communicating with a
mainframe & other servers (which may be
running on Windows), updating the binary
data, then sending the binary data back to
the Windows client in the original
Windows binary format.

Why go to all of this trouble?
Because "rule number one" of this
"porting project" is that we can not
require code/data changes on the Windows
Client side. Doing so would break
tens of thousands of existing client
installations. Rule number two is that,
except for the "wrapper API", the Server
code is single-source (i.e. all platforms
are built from the same set of source files).

Th Unix version's of the server are required
to be direct "drop in" replacements for the
Windows version of the server - requiring NO
changes to either the Windows clients or the
other servers (mainframe & Windows) with
which it interacts.

Surprisingly, this all works well on Linux
and Solaris. On AIX we experience random
traps under high load (300+ simultaneous
"requests" of 60k+ each).

#6 (**permalink**) 03-09-2008, 01:33 PM

Larry Smith wrote:
> [...]
> The threads "restart" because much of the
> data is generated by customer-written
> plugins & may have issues. They've used
> the setjmp/longjmp "restart" logic on
> Windows for over 15 years...

Larry, what you've just described is broken beyond belief.

Restarting a thread that just died due to bad data or incorrect parsing
of valid data doesn't fix the issue and is a stack-smasher's wet dream.

A thread with invalid data structures may corrupt all the other thread's
data.

> This is a huge financial app (100's of DLL's,
> 100+ exe's) written for Windows. It's been
> ported to 64 bit Linux, Solaris, & AIX by writing
> Unix versions of the Windows API's used by
> the app (e.g. xxThreadStart() wraps
> pthread_create() on Unix, but wraps
> _beginthreadex() on Windows; xxMutexCreate()
> wraps CreateMutex() on Windows, but wraps
> pthread_mutex_init() on Unix; etc, etc).
> There are hundreds of these "wrapper" API's.
>
> [...]
>
> Why go to all of this trouble?
> Because "rule number one" of this
> "porting project" is that we can not
> require code/data changes on the Windows
> Client side.
Clearly that version of the windows client was the only one that
compiled an was able to run more than 10 seconds.

The PHBs will be too afraid to touch it.

> Doing so would break
> tens of thousands of existing client
> installations. Rule number two is that,
> except for the "wrapper API", the Server
> code is single-source (i.e. all platforms
> are built from the same set of source files).
Even linux which is a single-source has a few files that are
architecture-specific.

It will all depend on how thick and well-written the glue layer is.

> The Unix version's of the server are required
> to be direct "drop in" replacements for the
> Windows version of the server - requiring NO
> changes to either the Windows clients or the
> other servers (mainframe & Windows) with
> which it interacts.
That makes sense (unlike the thread restart).

> Surprisingly, this all works well on Linux
> and Solaris.
Timming, alignment, smp vs up and exercised paths make all the difference.

> On AIX we experience random
> traps under high load (300+ simultaneous
> "requests" of 60k+ each).
"Random" traps are generaly the result of memory structures being
corrupted. Usualy this is due to a critical section not being correctly
protected.

#7 (**permalink**) 03-09-2008, 01:33 PM

Jose Pina Coelho wrote:
> Larry Smith wrote:
>> [...]
>> The threads "restart" because much of the
>> data is generated by customer-written
>> plugins & may have issues. They've used
>> the setjmp/longjmp "restart" logic on
>> Windows for over 15 years...
>
> Larry, what you've just described is broken beyond belief.
>
> Restarting a thread that just died due to bad data or incorrect parsing
> of valid data doesn't fix the issue and is a stack-smasher's wet dream.
>
> A thread with invalid data structures may corrupt all the other thread's
> data.
>

The "porting team" has no control over this
design. It has been in place (in 100's of
places in the source code) since at least
1995.

>
>> This is a huge financial app (100's of DLL's,
>> 100+ exe's) written for Windows. It's been
>> ported to 64 bit Linux, Solaris, & AIX by writing
>> Unix versions of the Windows API's used by
>> the app (e.g. xxThreadStart() wraps
>> pthread_create() on Unix, but wraps
>> _beginthreadex() on Windows; xxMutexCreate()
>> wraps CreateMutex() on Windows, but wraps
>> pthread_mutex_init() on Unix; etc, etc).
>> There are hundreds of these "wrapper" API's.
>>
>> [...]
>>
>> Why go to all of this trouble?
>> Because "rule number one" of this
>> "porting project" is that we can not
>> require code/data changes on the Windows
>> Client side.
> Clearly that version of the windows client was the only one that
> compiled an was able to run more than 10 seconds.
>

The Windows client & server (in numerous
versions) have been in production at thousands
of sites since the mid 1990's. The entire
system architecture was built around the
assumption that the "restart" approach is fine - and
on Windows it has worked for a long time.
To change the approach would require a complete
re-design/re-write of the entire system.
That was not approved...

[...]

>> Doing so would break
>> tens of thousands of existing client
>> installations. Rule number two is that,
>> except for the "wrapper API", the Server
>> code is single-source (i.e. all platforms
>> are built from the same set of source files).
> Even linux which is a single-source has a few files that are
> architecture-specific.
>
> It will all depend on how thick and well-written the glue layer is.
>

For this app, it's working quite well.
All of the OS-specific stuff is hidden
inside generic API's (e.g. xxMutexCreate()).
The sources for all lib's and executable's
call only these API's.

>> The Unix version's of the server are required
>> to be direct "drop in" replacements for the
>> Windows version of the server - requiring NO
>> changes to either the Windows clients or the
>> other servers (mainframe & Windows) with
>> which it interacts.
> That makes sense (unlike the thread restart).
>
>> Surprisingly, this all works well on Linux
>> and Solaris.
> Timming, alignment, smp vs up and exercised paths make all the difference.
>
> > On AIX we experience random
>> traps under high load (300+ simultaneous
>> "requests" of 60k+ each).
> "Random" traps are generaly the result of memory structures being
> corrupted. Usualy this is due to a critical section not being correctly
> protected.

Yes, I understand that - I just can't find it...

#8 (**permalink**) 03-09-2008, 01:33 PM

Larry Smith <[email protected]> writes:

>> Have you checked the application with Valgrind (Linux), or Purify
>> (Solaris and AIX)? No point in spending time guessing/debugging,
>> unless you know that the application is "squeaky clean".
....
> Valgrind & Purify - yes.

Yes what?

We've run the code under VG and Purify for large tests and they
detected no issues? (That's very hard to believe).

Or, yes we've run under VG and ignored all the bugs it found?

> No, they "restart" the thread.

Too bad. Essentially what you've described is a sure-fire recipe for
irreproducible crashes which are extremely hard to catch and debug.

Do you at least log the fact that such a "restart" has occured? (You
should.) Do the crashes follow such "restarts"? (I expect so).

> The threads "restart" because much of the
> data is generated by customer-written
> plugins & may have issues.

This is a brain-dead design: if your customer-written plugin corrupts
heap, and a later call to free() crashes (possibly in some other
thread), while holding heap lock, and you longjmp out of free(),
what hope do you have of making any further progress?

*None whatsoever*.

> They've used the setjmp/longjmp "restart" logic on
> Windows for over 15 years...

They've either got extremely lucky, or they didn't tell you the
whole story.

Also, on Windows customer-written DLLs may be statically linked
against LIBC{MT}.LIB, in which case they will not share malloc()
with the rest of the code.

But on UNIX there is only one "global" malloc, so the issue of buggy
"plugins" will be exacerbated.

> This is a huge financial app (100's of DLL's,
> 100+ exe's) written for Windows.

So keep it on Windows, and don't touch it (for your sanity's sake).

> Surprisingly, this all works well on Linux
> and Solaris.

Since you (apparently) have little (if any) AIX-specific code, I
expect the same bug(s) that cause AIX crashes are also present in
Solaris and Linux versions, and these versions don't "work well",
you just haven't observed the problems yet.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.

#9 (**permalink**) 03-10-2008, 04:53 PM

Paul Pluzhnikov wrote:
> Larry Smith <[email protected]> writes:
>
>>> Have you checked the application with Valgrind (Linux), or Purify
>>> (Solaris and AIX)? No point in spending time guessing/debugging,
>>> unless you know that the application is "squeaky clean".
> ...
>> Valgrind & Purify - yes.
>
> Yes what?
>
> We've run the code under VG and Purify for large tests and they
> detected no issues? (That's very hard to believe).
>
> Or, yes we've run under VG and ignored all the bugs it found?
>

See my reply to Jose's msg for addt'l info.

VG shows no issues.

The plugins run on the client, and build the
Wintel binary data sent to the server.

The traps occur on the AIX server.

The Wintel binary data blocks sent
by the client can be up to 200K in size;
they are memory images of "packed" 'C' struct's
concat'd together and sent from the client to
the server. The Windows version of the server
uses these binary images 'as is'.

The Unix server picks apart this Wintel binary
data, placing it into matching struct's
(in the server's native alignment), correcting
endian-ness as it goes. Then the server uses
this binary data to send transaction request
to one or more Mainframe's. The response
data is used to modify the binary data, which
is then put back into the Wintel format
expected by the client and sent back to the
client.

>> No, they "restart" the thread.
>
> Too bad. Essentially what you've described is a sure-fire recipe for
> irreproducible crashes which are extremely hard to catch and debug.
>
> Do you at least log the fact that such a "restart" has occured? (You
> should.) Do the crashes follow such "restarts"? (I expect so).
>

"restarts" are logged.

No, since we're working with the baseline test
data used to test all new releases of the app,
the trap/restart code is never invoked.
Testing with "bad" data will come later - after
the basic functional testing.

>> The threads "restart" because much of the
>> data is generated by customer-written
>> plugins & may have issues.
>
> This is a brain-dead design: if your customer-written plugin corrupts
> heap, and a later call to free() crashes (possibly in some other
> thread), while holding heap lock, and you longjmp out of free(),
> what hope do you have of making any further progress?
>
> *None whatsoever*.
>
>> They've used the setjmp/longjmp "restart" logic on
>> Windows for over 15 years...
>
> They've either got extremely lucky, or they didn't tell you the
> whole story.
>
> Also, on Windows customer-written DLLs may be statically linked
> against LIBC{MT}.LIB, in which case they will not share malloc()
> with the rest of the code.
>
> But on UNIX there is only one "global" malloc, so the issue of buggy
> "plugins" will be exacerbated.
>
>> This is a huge financial app (100's of DLL's,
>> 100+ exe's) written for Windows.
>
> So keep it on Windows, and don't touch it (for your sanity's sake).
>
>> Surprisingly, this all works well on Linux
>> and Solaris.
>
> Since you (apparently) have little (if any) AIX-specific code, I
> expect the same bug(s) that cause AIX crashes are also present in
> Solaris and Linux versions, and these versions don't "work well",
> you just haven't observed the problems yet.
>

Yes, I DO suspect the latent bug is everywhere,
but only exposes itself on AIX.

> Cheers,

Dozens of Windows developers & the System
Designer have worked on this app since
1989. We, the two new "Unix guys", do not
get to make design changes; we're charged
with making it work "as is" by writing
Windows emulations API's for Unix.

Thanks for your comments.

#10 (**permalink**) 03-10-2008, 04:53 PM

Larry Smith <[email protected]> writes:

>> Yes what? We've run the code under VG and Purify for large tests and
>> they detected no issues? (That's very hard to believe).
>> Or, yes we've run under VG and ignored all the bugs it found?
>>
>
> See my reply to Jose's msg for addt'l info.

There is no additional info in that reply (at least not anything
I can see) WRT Valgrind or Purify.

> VG shows no issues.

Ok. What about Purify on AIX?

> The plugins run on the client, and build the
> Wintel binary data sent to the server.

Ah, so the server side runs no customer code?

In that case the "restart" makes more sense -- the only time a
crash can legitimately be expected is while parsing the "blob"
that came from the client, and you can make it so you hold no locks
during that time.

So, going back to your crashes: what are some of the stack traces
where the crashes occured?

Since you know the data is "good", and you know there is a bug
somewhere, the first logical step is to ask debugger where the
crash is, isn't it?

Disable the "restart" code, let the app crash under load test,
collect a core file each time it does, analyze all of them,
look for commonalities and other clues. If you can't find any clues,
post some of the stack traces here -- someone might be able to
help you.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

SEO

DBX on AIX 5.3 with Threads