Building case with Chrono and type mismatch with instruction set architecture

sph_tudelft_nl · September 13, 2019, 5:03pm

Situation
nvcc --version: 10.1, V10.1.168
gcc --version: Debian 6.3.0-18+deb9u1 6.3.0 20170516
Chrono version: 4.0.0

Action 1
Building and running problem DamBreak3D is fine

Action 2
Instead
make problem=CompleteSaExample chrono=1
leads to

...
[CC] ProblemCore.o
[[CC] pugixml.o                                                                 
[CC] vector_print.o/usr/lib/gcc/x86_64-linux-gnu/6/include/avx512fintrin.h(9218): error: argument of type "const void *" is incompatible with parameter of type "const float *"
[...repeat x44 with types varying...]
[CC] Synchronizer.o/usr/lib/gcc/x86_64-linux-gnu/6/include/avx512vlintrin.h(10979): error: argument of type "const void *" is incompatible with parameter of type "const long long *"
/usr/lib/gcc/x86_64-linux-gnu/6/include/avx512vlintrin.h(10989): error: argument of type "void *" is incompatible with parameter of type "float *"
[...repeat x48 with types varying...]
[CC] VTUReader.o
[CC] main.o
[CC] Options.o
[CC] base64.o
[CC] GPUSPH.o
[CC] Integrator.o92 errors detected in the compilation of "/tmp/tmpxft_00000621_00000000-6_CompleteSaExample.cpp1.ii".
Makefile:1000: recipe for target 'build/problems/CompleteSaExample.o' failed
make: *** [build/problems/CompleteSaExample.o] Error 1
make: *** Waiting for unfinished jobs....

The header files avx512fintrin, avx512pfintrin and avx512vlintrin relate to the instruction set architecture AVX-512. The CPU processor is model Intel® Xeon® Silver 4112 CPU @ 2.60GHz: full specifications here.

Question
Any idea why turning Chrono on causes this conflict (assuming this is the reason) and how to work around it? Thanks in advance for digging into this. More info available upon request.

Shallow research (added)

suggests that the problem resides with three instruction subsets of AVX-512: F for foundation PF for prefetch instructions, and VL for vector length extensions. Parsing the header file names shows avx+512+f+intrin, avx+512+pf+intrin, and avx+512+vl+intrin.

The manual of gcc 6.3.0 promises support for the whole instruction set by adding the compilation flag -march=skylake-avx512 — see manual instructions (page 354). For fine-grained control one might even want to enable/disable the switches for single instruction subsets following the manual instructions (page 363) again.

There is much discussion in Q&A sites on this error occurring when compiling other programs with earlier versions of gcc (5x). Identifying a single promising avenue is not easy, as is often the case with discussion boards.

I am also realizing that a similar issue had already been reported in this forum at Compilation of next with Chrono using gcc 6.3.0 Like there, setting the compilation optimization to 0 solves the issue, and any other level 1 though 3 causes it again; most probably because AVX means an advanced vector extension that fits optimization purposes.

Hands-on research (added)

I have recompiled Chrono alone giving -march=skylake-avx512 as value of CMAKE_CXX_FLAGS and CMAKE_C_FLAGS. Running GPUSPH leads to a similar fail outcome
On top 1, I have also recompiled GPUSPH giving -march=skylake-avx512 as value of CXXFLAGS. Similar fail outcome.
Doing the opposite of 1 and 2, I have recompiled both Chrono and GPUSPH disabling the offending instruction sets giving -mno-avx512f -mno-avx512pf -mno-avx512vl to the appropriate compilation variables. Similar fail outcome.

giuseppe.bilotta · September 17, 2019, 8:20pm

Hello @sph_tudelft_nl,

the problem isn’t in itself in the GCC support for the latest AVX instructions as much as it is about what the NVIDIA compiler knows about such support —a downside of the single-source approach used by CUDA.

I am unaware of a good solution to this issue, which —judging by the internet— affects or has affected a lot of major CUDA projects. Possible workarounds in the GPUSPH+Chrono case are:

building Chrono and GPUSPH without AVX512 support (i.e. specifying a -march= that does NOT support AVX512);
building Chrono and GPUSPH with an older version of gcc that does not support AVX512 (e.g. gcc 4.9);
building Chrono and GPUSPH with a more recent version of gcc (8.3 from Debian buster seems to work correctly).

sph_tudelft_nl · September 18, 2019, 11:49am

Thanks @giuseppe.bilotta for the feedback. Returning the favour with hands-on experience:

Staying with 6.3.0: fail

I have set -march=skylake in both Chrono and GPUSPH. According to manual specifications, this switch does not provide support for AVX512 instruction set, yet for the AVX and AVX2 ones which are good-to-haves.
Sadly/strangely, same outcome: include files with pattern avx512* are invoked nonetheless, and the compilation fails.

From the GPUSPH’s own make show

CXXFLAGS: -march=skylake -m64 -std=c++11 -O3
CUFLAGS: -arch=sm_60 --generate-line-info -std=c++11 --compiler-options -march=skylake,-m64,-O3

Might it be that all executables and libraries invoked by Chrono+GPUSPH (MPI, HDF5, …) have to be recompiled for the same AVX512-free architecture? What is actually calling those includes?

Falling back to 4.x: temporarily ruled out

Not really desirable, considering that, once another compiler version is needed, is better to move onwards and upwards (optimistic ansatz).

Upgrading to >6.x: pending

I will try out this as soon as possible and edit this in accordance. The post remains open.

Please feel free to double check, for making silly mistakes is always possible, even when implementing brilliant ideas.

giuseppe.bilotta · September 18, 2019, 1:08pm

I would expect that just rebuilding Chrono and GPUSPH would suffice, unless nvcc is overriding the appropriate defines by itself.

To see where the inclusion is coming from, it should be possible to do something using the -H flag for the compiler, or -E -dI and then look at the preprocessed source. The idea would be to do make -j1 plain=1 echo=1 which would show the full compile line, and then running the same command-line manually after adding -E -dI or -H flags. This should show the include chain.

sph_tudelft_nl · September 18, 2019, 2:24pm

a) CXXFLAGS=-march=skylake -E -dI

I concentrate on the command make -j1 plain=1 echo=1 with CXXFLAGS=-march=skylake -E -dI. For completeness, man gcc says that -dI helps “output #include directives in addition to results of postprocessing” and -E is “stop after preprocessing stage”.

The interesting outcome is:

/opt64/cuda_10.1.168_418.67/bin/…/targets/x86_64-linux/include/cuda_runtime.h(83): catastrophic error: cannot open source file “crt/host_config.h”

In fact this file does exist:

/opt64/cuda_10.1.168_418.67/targets/x86_64-linux/include/crt/host_config.h: C source, ASCII text

and has vanilla permissions.
Adding

CUDA_INCLUDE_PATH+=-I$(CUDA_INSTALL_PATH)/targets/x86_64-linux/include
INCPATH+=-I$(CUDA_INSTALL_PATH)/targets/x86_64-linux/include

to the GPUSPH Makefile.local for good measure, does not change the outcome.

b) CXXFLAGS=-march=skyline -H

The flag -H prints the name of each header file used. The last lines before the long string of error messages are:

…
src/XYZReader.h
src/cuda/boundary_conditions.cu
src/cuda/buildneibs.cu
src/cuda/euler.cu
src/cuda/euler_kernel.def
src/cuda/forces.cu
src/cuda/gamma.cuh
src/cuda/geom_core.cu
src/cuda/neibs_iteration.cuh
src/cuda/post_process.cu
src/cuda/visc.cu
src/debugflags.def
/usr/lib/gcc/x86_64-linux-gnu/6/include/avx512fintrin.h(9218): error: argument of type “const void *” is incompatible with parameter of type “const float *”
…

giuseppe.bilotta · September 18, 2019, 6:14pm

I think approach b) would be the correct one, but the place of inclusion is probably much higher than the last few lines before the closure. Can you get the whole output and put it on a pastebin? make plain=1 -j1 build/problems/CompleteSaExample.o > make.log 2>&1 should capture everything.

sph_tudelft_nl · September 19, 2019, 5:48am

CXXFLAGS=-march=skylake -H and

output https://paste.ubuntu.com/p/Jnhyfwvb8x/

Side note: when turning off Chrono, the compilation passes and the run xfails.

giuseppe.bilotta · September 19, 2019, 7:19am

From the pastebin

............ /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/core/ChMatrix.h
............. /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/ChConfig.h
............. /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/serialization/ChArchiveAsciiDump.h
............. /usr/lib/gcc/x86_64-linux-gnu/6/include/immintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/mmintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/xmmintrin.h
............... /usr/lib/gcc/x86_64-linux-gnu/6/include/mm_malloc.h
................ /usr/include/c++/6/stdlib.h
............... /usr/lib/gcc/x86_64-linux-gnu/6/include/emmintrin.h
................ /usr/lib/gcc/x86_64-linux-gnu/6/include/xmmintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/pmmintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/tmmintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/smmintrin.h
............... /usr/lib/gcc/x86_64-linux-gnu/6/include/popcntintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/wmmintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avxintrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx2intrin.h
.............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512fintrin.h

So the inclusion of the AVX512 intrinsics is coming in via Chrono. Did you rebuild Chrono with -march=skylake?

sph_tudelft_nl · September 19, 2019, 8:03am

Yes. Namely:

CMAKE_CXX_FLAGS                  -march=skylake 
CMAKE_C_FLAGS                    -march=skylake

but not CUDA_NVCC_FLAGS, I realize, although CUDA_PROPAGATE_HOST_FLAGS is on. I give it another go. Is it then advisable to set the MPI flags for C++ likewise?

Update Unfortunately, no progress. The cache file of ccmake can be inspected at https://paste.ubuntu.com/p/5C7rsygr2j/ and the inclusion chain downstream of the Chrono call is this one

  ...... options/chrono_select.opt
  ...... /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/physics/ChBody.h
  ....... /usr/include/c++/6/cmath
  ....... /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/physics/ChBodyFrame.h
  ........ /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/core/ChFrameMoving.h
  ......... /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/core/ChFrame.h
  .......... /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/core/ChCoordsys.h
  .......... /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/core/ChMatrix33.h
  ........... /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/core/ChMatrixNM.h
  ............ /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/core/ChMatrix.h
  ............. /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/ChConfig.h
  ............. /mnt/nw-node1/nw/user/s/chrono/4.0.0/include/chrono/serialization/ChArchiveAsciiDump.h
  ............. /usr/lib/gcc/x86_64-linux-gnu/6/include/immintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/mmintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/xmmintrin.h
  ............... /usr/lib/gcc/x86_64-linux-gnu/6/include/mm_malloc.h
  ................ /usr/include/c++/6/stdlib.h
  ............... /usr/lib/gcc/x86_64-linux-gnu/6/include/emmintrin.h
  ................ /usr/lib/gcc/x86_64-linux-gnu/6/include/xmmintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/pmmintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/tmmintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/smmintrin.h
  ............... /usr/lib/gcc/x86_64-linux-gnu/6/include/popcntintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/wmmintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avxintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx2intrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512fintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512erintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512pfintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512cdintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512vlintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512bwintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512dqintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512vlbwintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512vldqintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512ifmaintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512ifmavlintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512vbmiintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/avx512vbmivlintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/shaintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/lzcntintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/bmiintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/bmi2intrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/fmaintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/f16cintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/rtmintrin.h
  .............. /usr/lib/gcc/x86_64-linux-gnu/6/include/xtestintrin.h

Update. I have placed a post in the Project Chrono forum for good measure

giuseppe.bilotta · September 20, 2019, 9:15am

Thanks for the debugging effort. I’m completely out of idea at the moment, so let’s hope the Chrono developers manage to get some good ones.

sph_tudelft_nl · November 8, 2019, 2:36pm

I am afraid that meanwhile I did not manage to make myself clear on the Project Chrono forum. Therefore, I have moved on to using gcc 8.3.0 where this issue has not shown up yet, as indeed indicated by @giuseppe.bilotta

giuseppe.bilotta · November 9, 2019, 11:41pm

I’ve been thinking about possible alternative solutions for this, but the only one I can think of is to split the problem source files into separate device and pure host parts. The device part would be limited to the SETUP_FRAMEWORK call and the overrides for the problem-specific open boundary conditions etc, and would NOT include Project Chrono headers, while the host part would include everything else (including the interfacing with Chrono). This should at the very least avoid the issue by not having nvcc see the Chrono inclusions and conversely.

This might require some adaptations of the Makefile, so that we have both a SomeProblem.cc for the host part and a SomeProblem.cu for the device part without the object files overwriting each other.