SegFault running on AWS GPUs

pbrady · March 25, 2019, 3:18am

Hey,

Not sure if this is a developer question or not but anyway:

I’ve been using GPUSPH on site with physical GPUs for a while now and generally happy with it. We have some more work coming up and looking to scale out for burst computing onto AWS but have hit some compilation issues that we’ve not seen locally.

The observations are:

CentOS 7 base image with our standard install that works locally
G2 instance works fine
chrono=1 problems appear to be working so far.
G3, P2 and P3 fail when using CHRONO.

When CHRONO fails we get a non-specific segfault, for example for the CompleteSaExample:

   Chrono Body pointer: 0xd57dd8
   Mass: 4.000001e+00
   CG:   -2.000000e-01	-5.000000e-01	5.000000e-01
   I:    1.000000e+00	0.000000e+00	0.000000e+00
         0.000000e+00	1.000000e+00	0.000000e+00
         0.000000e+00	0.000000e+00	1.000000e+00
   Q:    1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
Chrono collision shape
   B. box:   X [0.069,0.331], Y [0.369,0.631], Z [-0.631,-0.369]
   size:     X [0.262] Y [0.262] Z [0.262]
[ip-10-235-88-234:11837] *** Process received signal ***
[ip-10-235-88-234:11837] Signal: Segmentation fault (11)
[ip-10-235-88-234:11837] Signal code: Address not mapped (1)
[ip-10-235-88-234:11837] Failing at address: 0x64ff8
[ip-10-235-88-234:11837] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7ff90011d5d0]
[ip-10-235-88-234:11837] [ 1] /lib64/libc.so.6(+0x15acb6)[0x7ff8fe9d6cb6]
[ip-10-235-88-234:11837] [ 2] ./GPUSPH[0x463ac4]
[ip-10-235-88-234:11837] [ 3] ./GPUSPH[0x4640a5]
[ip-10-235-88-234:11837] [ 4] ./GPUSPH[0x462c6b]
[ip-10-235-88-234:11837] [ 5] ./GPUSPH[0x46d807]
[ip-10-235-88-234:11837] [ 6] ./GPUSPH[0x481e48]
[ip-10-235-88-234:11837] [ 7] ./GPUSPH[0x422da7]
[ip-10-235-88-234:11837] [ 8] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff8fe89e3d5]
[ip-10-235-88-234:11837] [ 9] ./GPUSPH[0x425e99]
[ip-10-235-88-234:11837] *** End of error message ***
Segmentation fault

Has anyone seen this behaviour before?

Thanks in advance,
-pete

pbrady · March 25, 2019, 3:25am

Googling more this appears similar to:

But I can confirm that I’m using the same compilers:

GCC 4.8.5 native on CentOS
CUDA 10.1.105

Hrmm, still working through this. Got to be something simple I’m missing…

-pete

giuseppe.bilotta · March 25, 2019, 11:50am

Hello @pbrady, and thanks for your interest in GPUSPH.

I can confirm that I’ve seen these kind of problems always in relation to a compilation mismatch between Chrono and GPUSPH. The mismatch may be a different compiler, or different flags (for example, one built as C++98, the other as C++11). Some additional information may be useful for debugging:

what version or commit of Chrono are you using?
what version or commit of GPUSPH?
is the compiler the same in different instances? are there, more in general, differences between the instances?

One thing can provide more insight is to build GPUSPH with debug symbols, so that running it inside gdb gives hopefully more useful information with a backtrace. So, for example, create a file Makefile.local and add the following line to it:

CPPFLAGS=-g

Then do a make clean and build again, and run inside gdb.

pbrady · March 25, 2019, 10:52pm

Hi @giuseppe.bilotta,

Thanks for the quick reply. I’ll shoot this reply off as I’m out of the office for a couple of hours so won’t get to the debugger right away. My builds are scripted to be as consistent as I can but roughly:

CentOS 7 native compiler: GCC 4.8.5 - this plays the nicest with CUDA and I stick with CentOS 7 for my other apps
Chrono 3.0.0 via git - the latest 4.0.0 requires GCC >-4.9.0. Build as per the the install instructions, i.e.: git → ccmake → make → make install, i.e.: turn off modules that are not needed
CUDA 10.1.105
GPUSPH is running with a Makefile.local because I hit that other bug in Chrono about missing the AVX extensions headers.

My Makefile.local is:

CXXFLAGS+=“-march=native”

Working back I think there are two changes that I made to my build system as testing late last night on a local machine that had not been patched for a while is now also failing:

I’ve rolled up the CentOS patches from the last few weeks
I’ve rolled up to CUDA 10.1.105

Incidentally, and I assume that you are aware of this already, but CUDA 10 has some deprecated functions that are now throwing warnings at compilation of GPUSPH 4.1. The maths header is an easy change but the safe_copy function could be a little more thought.

Thanks for the help so far. I’ll post results when I get back to the debugger.

Cheers,
-pete

pbrady · March 26, 2019, 2:58am

@giuseppe.bilotta,

OK, thanks for the pointer on GDB and I think I’ve fixed this now - still testing but is actually running.

Short answer: I had to recompile CHRONO with:

ENABLE_OPENMP = ON
USE_PARALLEL_CUDA = ON

Not sure what the change from the previous builds was but simulations with CHRONO are now running - at least the first few time steps for testing. Will continue to test and see what happens.

Details for your information though.

Step 1:

make chrono=1 mpi=0 dbg=1 verbose=1 plain=1 echo=1 problem=CompleteSaExample

worked fine

Step 2:

make chrono=1 mpi=1 dbg=1 verbose=1 plain=1 echo=1 problem=CompleteSaExample -j 8

failed with gdb dumping:

Program received signal SIGSEGV, Segmentation fault.
0x000000000047f948 in __gnu_cxx::new_allocator<MovingBodyData*>::construct<MovingBodyData*<MovingBodyData* const&> > (this=0xe7eda8, __p=0x4c53f5a8)
at /usr/include/c++/4.8.2/ext/new_allocator.h:120
120 { ::new((void *)__p) _Up(std::forward<_Args>(__args)…); }

So figured nothing to loose by recompiling CHRONO with OPENMP support and voilà here we are.

Having said all that, my solution does not make sense to me: What is the interaction with MPI in GPUSPH and OPENMP in CHRONO?

Will contine to test…

Thanks for the advice so far,
-pete

giuseppe.bilotta · March 26, 2019, 9:20am

Hello @pbrady,

interesting, I too wonder what the interaction between OpenMP in Chrono and MPI in GPUSPH would be. Maybe rather than OpenMP specifically there’s some other compilation flag that gets automatically enabled/disabled because of that, and that’s the one conflicting. For example, does Chrono 3 build with C++11 by default on GCC 4.8?

Can you provide the full backtrace from the segfault?

Also, my understanding is that you’re currently testing the 4.1 release or master branch of GPUSPH, can you check if the next branch has the same issues?

pbrady · March 26, 2019, 10:00am

Hi,

Yup am on the current release 4.1. Happy to try and experiment with next branch and see what happens. I’ll run those tests tomorrow morning when I get to the office and post for your information.

Cheers,
-pete

pbrady · March 27, 2019, 3:18am

Hi,

Still working through the testing here. So far:

Rebuilt my local physical machines from scratch: problem resolved and they appear to be working as per the installation documentation. Not a great fix but if it works I’ll take it. Will continue to test.

Now focussing on EC2 instances, which still show issues. The above fix from a previous thread did not work.

GPUSPH next branch appears to need a more modern compiler than 4.8.5 on CentOS 7. I’ll try to spin up a Fedora 29 instance later and try that.

I’m unable to get a stack trace for you. Now this is very strange to me, mind you my C++/Cuda is not that great. My environment:

EC2 g3s.xlarge
CHRONO 3.0.0 compiled as per installation documents
GPUSPH 4.1
Compile settings:

make chrono=1 mpi=1 dbg=1 verbose=1 plain=1 echo=1 problem=CompleteSaExample -j 8

Results:

Running natively: problem dies
Running under GDB: problem executes!

The stack trace from the terminal is:

Chrono collision shape
   B. box:   X [0.069,0.331], Y [0.369,0.631], Z [-0.631,-0.369]
   size:     X [0.262] Y [0.262] Z [0.262]
[ip-10-235-88-234:19085] *** Process received signal ***
[ip-10-235-88-234:19085] Signal: Segmentation fault (11)
[ip-10-235-88-234:19085] Signal code: Address not mapped (1)
[ip-10-235-88-234:19085] Failing at address: 0x64ff8
[ip-10-235-88-234:19085] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7faa4b5535d0]
[ip-10-235-88-234:19085] [ 1] /lib64/libc.so.6(+0x15acb6)[0x7faa49e0ccb6]
[ip-10-235-88-234:19085] [ 2] ./GPUSPH_dbg[0x4807ad]
[ip-10-235-88-234:19085] [ 3] ./GPUSPH_dbg[0x48069e]
[ip-10-235-88-234:19085] [ 4] ./GPUSPH_dbg[0x480545]
[ip-10-235-88-234:19085] [ 5] ./GPUSPH_dbg[0x4803c7]
[ip-10-235-88-234:19085] [ 6] ./GPUSPH_dbg[0x48016b]
[ip-10-235-88-234:19085] [ 7] ./GPUSPH_dbg[0x47fb72]
[ip-10-235-88-234:19085] [ 8] ./GPUSPH_dbg[0x47f38a]
[ip-10-235-88-234:19085] [ 9] ./GPUSPH_dbg[0x47e86d]
[ip-10-235-88-234:19085] [10] ./GPUSPH_dbg[0x47d88e]
[ip-10-235-88-234:19085] [11] ./GPUSPH_dbg[0x47bebc]
[ip-10-235-88-234:19085] [12] ./GPUSPH_dbg[0x474960]
[ip-10-235-88-234:19085] [13] ./GPUSPH_dbg[0x490ecb]
[ip-10-235-88-234:19085] [14] ./GPUSPH_dbg[0x4a1531]
[ip-10-235-88-234:19085] [15] ./GPUSPH_dbg[0x488117]
[ip-10-235-88-234:19085] [16] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7faa49cd43d5]
[ip-10-235-88-234:19085] [17] ./GPUSPH_dbg[0x423699]
[ip-10-235-88-234:19085] *** End of error message ***
Segmentation fault
[centos@ip-10-235-88-234 gpusph]$

As per previous symptoms changing the compile line to:

make chrono=1 mpi=0 dbg=1 verbose=1 plain=1 echo=1 problem=CompleteSaExample -j 8

Allows the problem to work.

Thanks for the help so far, has to be something I’m missing. Will continue to test here.

Cheers,
-pete

giuseppe.bilotta · March 27, 2019, 9:49am

Hello @pbrady,

GPUSPH next branch appears to need a more modern compiler than 4.8.5 on CentOS 7. I’ll try to spin up a Fedora 29 instance later and try that.

Interesting. I’m guessing this is due to the more aggressive usage of C++11 features that we rely on for next. We’ll see if we can lower the limit a bit.

By the way, there is generally no need to build with dbg=1, since that also builds the device code in debug mode, which is (1) extremely slow to build and (2) extremely slow to run. Adding -g to the CXXFLAGS and CUFLAGS (in Makefile.local) is usually sufficient.

To get a backtrace, a typical way is the following:

$ gdb --args ./GPUSPH
(gdb) run
[problem runs until it crashes]
(gdb) bt

Of course, if the problem doesn’t crash when running under the debugger, this is considerably more difficult, but given the stack trace that you get, you can use addr2line to recover the source line. From the example that you pasted, for example, one could use:

addr2line -e ./GPUSPH_dbg 0x4807ad 0x48069e 0x480545

to get the three topmost contexts in GPUSPH that led to the crash.

The fact that the error is only reproducible with MPI enabled makes me suspect that this might be an issue with the C++ ABI used by MPI in the EC2 instances.

pbrady · March 27, 2019, 10:58pm

Hi,

Thanks for the suggestions - the addr2line is a new one for me.

I’m going to have to put this on hold for a couple of days as I need to get some project work out the door but I’ll be continuing to investigate. For now free surface is enough but I’ll need the CHRONO free body section of the code in a month or two as that particular project comes back, hence my preparation now.

I think what I’ll try next is two fold:

Fedora has much more advanced compilers and is specifically suported with CUDA.
With my OpenFOAM work on AWS I ended up rolling a self compiled newer version of OpenMPI that that packaged on CentOS 7.

Thanks for the help so far and I’ll continue to post my results here.

Cheers,
-pete

giuseppe.bilotta · March 28, 2019, 11:26am

Excellent. Do remember to rebuild with dbg=0 for production use

Cheers,

Giuseppe