vBulletin Search Engine Optimization
| |||||||
| Register | FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| We have a large app that's built on Linux, Solaris, & AIX. The app uses many threads (pthreads API). There is a bug in the app that shows itself only on AIX. All (seems to) work OK on Linux & Solaris. On AIX some threads trap at random (i.e. there is no recognizable pattern to the traps). The app has a signal handler that catches all signals and restarts the offending thread; however, the data for the transaction that was being processed when the trap occurred is lost. We disabled the program's signal handler and ran it under DBX (dbx -r). The program hangs without doing any work. Thread 0 starts many child threads in a loop. It starts a child thread, then calls pthread_cond_timedwait() to wait on notification from the thread that it has finished its initial setup (ODBC database connections, etc) and is ready to accept work. The child thread calls pthread_cond_signal() to notify thread 0 when its initial setup is complete. At that time thread 0 loops and starts the next child thread - again calling pthread_cond_timedwait() to wait for the child thead's "I'm started" notification. Under DBX, thread 0 appears to hang on the pthread_cond_signal() call - i.e. the pthread_cond_signal() never returns. Any ideas??? |
| |||
| On Mar 6, 7:21*am, Larry Smith <[email protected]> wrote: > We have a large app that's built on > Linux, Solaris, & AIX. *The app uses > many threads (pthreads API). > > There is a bug in the app that shows itself > only on AIX. *All (seems to) work OK on > Linux & Solaris. *On AIX some threads trap > at random (i.e. there is no recognizable > pattern to the traps). > > The app has a signal handler that catches > all signals and restarts the offending > thread; however, the data for the transaction > that was being processed when the trap > occurred is lost. > > We disabled the program's signal handler > and ran it under DBX (dbx -r). *The program > hangs without doing any work. > > Thread 0 starts many child threads in a loop. > It starts a child thread, then calls > pthread_cond_timedwait() to wait on notification > from the thread that it has finished its initial > setup (ODBC database connections, etc) and is ready > to accept work. > > The child thread calls pthread_cond_signal() > to notify thread 0 when its initial setup > is complete. *At that time thread 0 loops > and starts the next child thread - again > calling pthread_cond_timedwait() to wait > for the child thead's "I'm started" > notification. > > Under DBX, thread 0 appears to hang on > the pthread_cond_signal() call - i.e. > the pthread_cond_signal() never returns. > > Any ideas??? No idea if this is your problem, but we found that "alignment" of pthread structures is extremely important. We were compiling programs on AIX (5.1, & 5.3) with the "-qalign=packed" parameter to IBM's Visual Age 'C' compiler. These 'C' programs include pthread structures (such as pthread_mutex_t, pthread_attr_t, pthread_t, & pthread_cond_t) within other structures. We experienced similar symptoms... some programs would appear to "lose" signals and sometimes have what appears to be hanging threads. The problem was solved by moving the pthread structures "out" of other structures, or by moving them to be the first members of the other structures....or by surrounding the structure containing pthread structures with a PRAGMA (#pragma options align=natural) Of course, your milage may vary. -tony |
| |||
| [email protected] wrote: > On Mar 6, 7:21 am, Larry Smith <[email protected]> wrote: >> We have a large app that's built on >> Linux, Solaris, & AIX. The app uses >> many threads (pthreads API). >> >> There is a bug in the app that shows itself >> only on AIX. All (seems to) work OK on >> Linux & Solaris. On AIX some threads trap >> at random (i.e. there is no recognizable >> pattern to the traps). >> >> The app has a signal handler that catches >> all signals and restarts the offending >> thread; however, the data for the transaction >> that was being processed when the trap >> occurred is lost. >> >> We disabled the program's signal handler >> and ran it under DBX (dbx -r). The program >> hangs without doing any work. >> >> Thread 0 starts many child threads in a loop. >> It starts a child thread, then calls >> pthread_cond_timedwait() to wait on notification >> from the thread that it has finished its initial >> setup (ODBC database connections, etc) and is ready >> to accept work. >> >> The child thread calls pthread_cond_signal() >> to notify thread 0 when its initial setup >> is complete. At that time thread 0 loops >> and starts the next child thread - again >> calling pthread_cond_timedwait() to wait >> for the child thead's "I'm started" >> notification. >> >> Under DBX, thread 0 appears to hang on >> the pthread_cond_signal() call - i.e. >> the pthread_cond_signal() never returns. >> >> Any ideas??? > > No idea if this is your problem, but we found that "alignment" of > pthread structures > is extremely important. We were compiling programs on AIX (5.1, & > 5.3) with the > "-qalign=packed" parameter to IBM's Visual Age 'C' compiler. These > 'C' programs > include pthread structures (such as pthread_mutex_t, pthread_attr_t, > pthread_t, & pthread_cond_t) > within other structures. We experienced similar symptoms... some > programs would > appear to "lose" signals and sometimes have what appears to be hanging > threads. > > The problem was solved by moving the pthread structures "out" of other > structures, > or by moving them to be the first members of the other > structures....or by surrounding > the structure containing pthread structures with a PRAGMA (#pragma > options align=natural) > > Of course, your milage may vary. > > -tony Thanks for the info... We do not use -qalign, so we're getting the platform defaults. DBX has been no help. The app normally comes up & starts its 30+ threads in a few seconds. Under DBX it took six minutes to get just the first two threads up. So, if the issue is related to thread contention, we'll never see it under DBX because everything has slowed down so much. We do not have this kind of performance penalty running under the Linux 'kdbg' debugger. However, the app never traps on Linux or Solaris - only on AIX. "electric fence" does not show any buffer overruns. We're stumped on the AIX issues. |
| |||
| Larry Smith <[email protected]> writes: > We have a large app that's built on > Linux, Solaris, & AIX. The app uses > many threads (pthreads API). Have you checked the application with Valgrind (Linux), or Purify (Solaris and AIX)? No point in spending time guessing/debugging, unless you know that the application is "squeaky clean". > There is a bug in the app that shows itself > only on AIX. All (seems to) work OK on > Linux & Solaris. On AIX some threads trap > at random (i.e. there is no recognizable > pattern to the traps). > > The app has a signal handler that catches > all signals and restarts the offending > thread; You can't restart a thread. If one thread of the application gets a fatal signal, there is absolutely no reasonable way to recover from that. I assume you mean you restart the whole process. > We disabled the program's signal handler > and ran it under DBX (dbx -r). The program > hangs without doing any work. If dbx interfers so much with your application, you should be able to let it core-dump, and debug it post-mortem. What are some of the crash stack traces? Cheers, -- In order to understand recursion you must first understand recursion. Remove /-nsp/ for email. |
| |||
| Paul Pluzhnikov wrote: > Larry Smith <[email protected]> writes: > >> We have a large app that's built on >> Linux, Solaris, & AIX. The app uses >> many threads (pthreads API). > > Have you checked the application with Valgrind (Linux), or Purify > (Solaris and AIX)? No point in spending time guessing/debugging, > unless you know that the application is "squeaky clean". > >> There is a bug in the app that shows itself >> only on AIX. All (seems to) work OK on >> Linux & Solaris. On AIX some threads trap >> at random (i.e. there is no recognizable >> pattern to the traps). >> >> The app has a signal handler that catches >> all signals and restarts the offending >> thread; > > You can't restart a thread. If one thread of the application gets > a fatal signal, there is absolutely no reasonable way to recover > from that. I assume you mean you restart the whole process. > >> We disabled the program's signal handler >> and ran it under DBX (dbx -r). The program >> hangs without doing any work. > > If dbx interfers so much with your application, you should be able > to let it core-dump, and debug it post-mortem. What are some of > the crash stack traces? > > Cheers, No, they "restart" the thread. Each thread does a setjmp(), registers the jmpbuf with a global handler, then goes into a processing loop. When a trap occurs (mem access, math error, etc) the global handler looks up the approp jmpbuf for the thread then uses it to do a longjmp - "restarting" the thread, which repeats the setjmp() and re-enters it process loop. The threads "restart" because much of the data is generated by customer-written plugins & may have issues. They've used the setjmp/longjmp "restart" logic on Windows for over 15 years... This is a huge financial app (100's of DLL's, 100+ exe's) written for Windows. It's been ported to 64 bit Linux, Solaris, & AIX by writing Unix versions of the Windows API's used by the app (e.g. xxThreadStart() wraps pthread_create() on Unix, but wraps _beginthreadex() on Windows; xxMutexCreate() wraps CreateMutex() on Windows, but wraps pthread_mutex_init() on Unix; etc, etc). There are hundreds of these "wrapper" API's. Valgrind & Purify - yes. The server side (where the issues are) recv's blocks of BINARY data created by Windows clients (8K up to 400K). These blocks contain the memory images of lists of heterogeneous "packed" 'C' struct's created on the Windows clients. The Windows version of the server code uses these binary blocks "as is". The Unix version of the server code has to pick this data apart and rebuild the same 'C' struct's in the server's native layout. The server processes these "transaction requests" by communicating with a mainframe & other servers (which may be running on Windows), updating the binary data, then sending the binary data back to the Windows client in the original Windows binary format. Why go to all of this trouble? Because "rule number one" of this "porting project" is that we can not require code/data changes on the Windows Client side. Doing so would break tens of thousands of existing client installations. Rule number two is that, except for the "wrapper API", the Server code is single-source (i.e. all platforms are built from the same set of source files). Th Unix version's of the server are required to be direct "drop in" replacements for the Windows version of the server - requiring NO changes to either the Windows clients or the other servers (mainframe & Windows) with which it interacts. Surprisingly, this all works well on Linux and Solaris. On AIX we experience random traps under high load (300+ simultaneous "requests" of 60k+ each). |
| |||
| Larry Smith wrote: > [...] > The threads "restart" because much of the > data is generated by customer-written > plugins & may have issues. They've used > the setjmp/longjmp "restart" logic on > Windows for over 15 years... Larry, what you've just described is broken beyond belief. Restarting a thread that just died due to bad data or incorrect parsing of valid data doesn't fix the issue and is a stack-smasher's wet dream. A thread with invalid data structures may corrupt all the other thread's data. > This is a huge financial app (100's of DLL's, > 100+ exe's) written for Windows. It's been > ported to 64 bit Linux, Solaris, & AIX by writing > Unix versions of the Windows API's used by > the app (e.g. xxThreadStart() wraps > pthread_create() on Unix, but wraps > _beginthreadex() on Windows; xxMutexCreate() > wraps CreateMutex() on Windows, but wraps > pthread_mutex_init() on Unix; etc, etc). > There are hundreds of these "wrapper" API's. > > [...] > > Why go to all of this trouble? > Because "rule number one" of this > "porting project" is that we can not > require code/data changes on the Windows > Client side. Clearly that version of the windows client was the only one that compiled an was able to run more than 10 seconds. The PHBs will be too afraid to touch it. > Doing so would break > tens of thousands of existing client > installations. Rule number two is that, > except for the "wrapper API", the Server > code is single-source (i.e. all platforms > are built from the same set of source files). Even linux which is a single-source has a few files that are architecture-specific. It will all depend on how thick and well-written the glue layer is. > The Unix version's of the server are required > to be direct "drop in" replacements for the > Windows version of the server - requiring NO > changes to either the Windows clients or the > other servers (mainframe & Windows) with > which it interacts. That makes sense (unlike the thread restart). > Surprisingly, this all works well on Linux > and Solaris. Timming, alignment, smp vs up and exercised paths make all the difference. > On AIX we experience random > traps under high load (300+ simultaneous > "requests" of 60k+ each). "Random" traps are generaly the result of memory structures being corrupted. Usualy this is due to a critical section not being correctly protected. |
| |||
| Jose Pina Coelho wrote: > Larry Smith wrote: >> [...] >> The threads "restart" because much of the >> data is generated by customer-written >> plugins & may have issues. They've used >> the setjmp/longjmp "restart" logic on >> Windows for over 15 years... > > Larry, what you've just described is broken beyond belief. > > Restarting a thread that just died due to bad data or incorrect parsing > of valid data doesn't fix the issue and is a stack-smasher's wet dream. > > A thread with invalid data structures may corrupt all the other thread's > data. > The "porting team" has no control over this design. It has been in place (in 100's of places in the source code) since at least 1995. > >> This is a huge financial app (100's of DLL's, >> 100+ exe's) written for Windows. It's been >> ported to 64 bit Linux, Solaris, & AIX by writing >> Unix versions of the Windows API's used by >> the app (e.g. xxThreadStart() wraps >> pthread_create() on Unix, but wraps >> _beginthreadex() on Windows; xxMutexCreate() >> wraps CreateMutex() on Windows, but wraps >> pthread_mutex_init() on Unix; etc, etc). >> There are hundreds of these "wrapper" API's. >> >> [...] >> >> Why go to all of this trouble? >> Because "rule number one" of this >> "porting project" is that we can not >> require code/data changes on the Windows >> Client side. > Clearly that version of the windows client was the only one that > compiled an was able to run more than 10 seconds. > The Windows client & server (in numerous versions) have been in production at thousands of sites since the mid 1990's. The entire system architecture was built around the assumption that the "restart" approach is fine - and on Windows it has worked for a long time. To change the approach would require a complete re-design/re-write of the entire system. That was not approved... [...] >> Doing so would break >> tens of thousands of existing client >> installations. Rule number two is that, >> except for the "wrapper API", the Server >> code is single-source (i.e. all platforms >> are built from the same set of source files). > Even linux which is a single-source has a few files that are > architecture-specific. > > It will all depend on how thick and well-written the glue layer is. > For this app, it's working quite well. All of the OS-specific stuff is hidden inside generic API's (e.g. xxMutexCreate()). The sources for all lib's and executable's call only these API's. >> The Unix version's of the server are required >> to be direct "drop in" replacements for the >> Windows version of the server - requiring NO >> changes to either the Windows clients or the >> other servers (mainframe & Windows) with >> which it interacts. > That makes sense (unlike the thread restart). > >> Surprisingly, this all works well on Linux >> and Solaris. > Timming, alignment, smp vs up and exercised paths make all the difference. > > > On AIX we experience random >> traps under high load (300+ simultaneous >> "requests" of 60k+ each). > "Random" traps are generaly the result of memory structures being > corrupted. Usualy this is due to a critical section not being correctly > protected. Yes, I understand that - I just can't find it... |
| |||
| Larry Smith <[email protected]> writes: >> Have you checked the application with Valgrind (Linux), or Purify >> (Solaris and AIX)? No point in spending time guessing/debugging, >> unless you know that the application is "squeaky clean". .... > Valgrind & Purify - yes. Yes what? We've run the code under VG and Purify for large tests and they detected no issues? (That's very hard to believe). Or, yes we've run under VG and ignored all the bugs it found? > No, they "restart" the thread. Too bad. Essentially what you've described is a sure-fire recipe for irreproducible crashes which are extremely hard to catch and debug. Do you at least log the fact that such a "restart" has occured? (You should.) Do the crashes follow such "restarts"? (I expect so). > The threads "restart" because much of the > data is generated by customer-written > plugins & may have issues. This is a brain-dead design: if your customer-written plugin corrupts heap, and a later call to free() crashes (possibly in some other thread), while holding heap lock, and you longjmp out of free(), what hope do you have of making any further progress? *None whatsoever*. > They've used the setjmp/longjmp "restart" logic on > Windows for over 15 years... They've either got extremely lucky, or they didn't tell you the whole story. Also, on Windows customer-written DLLs may be statically linked against LIBC{MT}.LIB, in which case they will not share malloc() with the rest of the code. But on UNIX there is only one "global" malloc, so the issue of buggy "plugins" will be exacerbated. > This is a huge financial app (100's of DLL's, > 100+ exe's) written for Windows. So keep it on Windows, and don't touch it (for your sanity's sake). > Surprisingly, this all works well on Linux > and Solaris. Since you (apparently) have little (if any) AIX-specific code, I expect the same bug(s) that cause AIX crashes are also present in Solaris and Linux versions, and these versions don't "work well", you just haven't observed the problems yet. Cheers, -- In order to understand recursion you must first understand recursion. Remove /-nsp/ for email. |
| |||
| Paul Pluzhnikov wrote: > Larry Smith <[email protected]> writes: > >>> Have you checked the application with Valgrind (Linux), or Purify >>> (Solaris and AIX)? No point in spending time guessing/debugging, >>> unless you know that the application is "squeaky clean". > ... >> Valgrind & Purify - yes. > > Yes what? > > We've run the code under VG and Purify for large tests and they > detected no issues? (That's very hard to believe). > > Or, yes we've run under VG and ignored all the bugs it found? > See my reply to Jose's msg for addt'l info. VG shows no issues. The plugins run on the client, and build the Wintel binary data sent to the server. The traps occur on the AIX server. The Wintel binary data blocks sent by the client can be up to 200K in size; they are memory images of "packed" 'C' struct's concat'd together and sent from the client to the server. The Windows version of the server uses these binary images 'as is'. The Unix server picks apart this Wintel binary data, placing it into matching struct's (in the server's native alignment), correcting endian-ness as it goes. Then the server uses this binary data to send transaction request to one or more Mainframe's. The response data is used to modify the binary data, which is then put back into the Wintel format expected by the client and sent back to the client. >> No, they "restart" the thread. > > Too bad. Essentially what you've described is a sure-fire recipe for > irreproducible crashes which are extremely hard to catch and debug. > > Do you at least log the fact that such a "restart" has occured? (You > should.) Do the crashes follow such "restarts"? (I expect so). > "restarts" are logged. No, since we're working with the baseline test data used to test all new releases of the app, the trap/restart code is never invoked. Testing with "bad" data will come later - after the basic functional testing. >> The threads "restart" because much of the >> data is generated by customer-written >> plugins & may have issues. > > This is a brain-dead design: if your customer-written plugin corrupts > heap, and a later call to free() crashes (possibly in some other > thread), while holding heap lock, and you longjmp out of free(), > what hope do you have of making any further progress? > > *None whatsoever*. > >> They've used the setjmp/longjmp "restart" logic on >> Windows for over 15 years... > > They've either got extremely lucky, or they didn't tell you the > whole story. > > Also, on Windows customer-written DLLs may be statically linked > against LIBC{MT}.LIB, in which case they will not share malloc() > with the rest of the code. > > But on UNIX there is only one "global" malloc, so the issue of buggy > "plugins" will be exacerbated. > >> This is a huge financial app (100's of DLL's, >> 100+ exe's) written for Windows. > > So keep it on Windows, and don't touch it (for your sanity's sake). > >> Surprisingly, this all works well on Linux >> and Solaris. > > Since you (apparently) have little (if any) AIX-specific code, I > expect the same bug(s) that cause AIX crashes are also present in > Solaris and Linux versions, and these versions don't "work well", > you just haven't observed the problems yet. > Yes, I DO suspect the latent bug is everywhere, but only exposes itself on AIX. > Cheers, Dozens of Windows developers & the System Designer have worked on this app since 1989. We, the two new "Unix guys", do not get to make design changes; we're charged with making it work "as is" by writing Windows emulations API's for Unix. Thanks for your comments. |
| ||||
| Larry Smith <[email protected]> writes: >> Yes what? We've run the code under VG and Purify for large tests and >> they detected no issues? (That's very hard to believe). >> Or, yes we've run under VG and ignored all the bugs it found? >> > > See my reply to Jose's msg for addt'l info. There is no additional info in that reply (at least not anything I can see) WRT Valgrind or Purify. > VG shows no issues. Ok. What about Purify on AIX? > The plugins run on the client, and build the > Wintel binary data sent to the server. Ah, so the server side runs no customer code? In that case the "restart" makes more sense -- the only time a crash can legitimately be expected is while parsing the "blob" that came from the client, and you can make it so you hold no locks during that time. So, going back to your crashes: what are some of the stack traces where the crashes occured? Since you know the data is "good", and you know there is a bug somewhere, the first logical step is to ask debugger where the crash is, isn't it? Disable the "restart" code, let the app crash under load test, collect a core file each time it does, analyze all of them, look for commonalities and other clues. If you can't find any clues, post some of the stack traces here -- someone might be able to help you. Cheers, -- In order to understand recursion you must first understand recursion. Remove /-nsp/ for email. |