How to read the code

JoJo · May 14, 2020, 3:54am

hi, i want to implement some feature base on GPUSPH. And i found it difficult studying the code.

Typically, i cannt find the main loop (‘for’ or ‘while’ loop for updating the variables), But i found ‘RunCommand’, i think maybe this is the loop. circulaer execute the CommandSequence defined by the instance of integrator, not sure am i right.

And, is there any informantion i missed, i mean the support provided by the guide is quite limited…

giuseppe.bilotta · May 14, 2020, 7:36am

Hello @JoJo, you’re absolutely right, the documentation is in dire need of improvements.

While there is a “main loop” in GPUSPH (GPUSPH::runSimulaiton() in src/GPUSPH.cc, the while () try { dispatchCommand() } catch {}), this is not where the logic of the main loop is stored.

The reason for this is that, depending on the simulation setup and integration scheme, the steps that need to be executed can be very different, and having all these integrated conditionals in the main loop made things extremely fragile.

The sequence of commands is now defined once by the chosen Integrator, which for normal simulations is the PredictorCorrectorIntegrator. For integrators, there is some common code in src/Integrator.h and src/Integrator.cc, and the integrator-specific code under appropriate files in src/integrators (e.g. for the standard predictor/corrector, the files are PredictorCorrectorIntegrator.cc and the corresponding header file).

Integrators collect sequences of commands, grouping them into “phases”. The logic to transition between one phase and the next is in the next_phase() method of the integrator, while the sequence of commands that compose each phase are defined in each initializePhase specialization.

Here’s a scheme of the current predictor/corrector integration scheme (you can open the image and view it in detail):

The actual commands defined by the integrator are then read one after the other in the loop in GPUSPH::runSimulation() and dispatched to the appropriate component: some commands are run on host, and their implementation is the runCommand specializations in GPUSPH itself, while others are run on device, and their implementation is the runCommand specializations in GPUWorker (there will be one running for every GPU used in the simulation).

Most of the worker commands end up calling CUDA kernels, which are defined in the .cu files under src/cuda, with different specializations based on the static simulation configuration (“framework options” such as the choice of smoothing kernel or boundary model).

I hope this is enough to get you started. Feel free to ask more if needed.

JoJo · May 14, 2020, 11:54am

Thanks, actually i am a user of Dualsphysics. By their code can not support multi-device, also multi-node. So im here, i have to say that your code is much more difficult than Dualsphysics, but highly expandable.

And, i found some marco like BLOCK_SIZE_FMAX in forces.cu and it is used to launch a kernel function in CUDA, all these marco are a fixed number, are these number optimal for any device? (RTX, TITAN, TESLA), or experimental.

It is hard for me to understand your code now, especially runSimulation and runCommand,and i will ask for more support next time.

giuseppe.bilotta · May 14, 2020, 5:02pm

Thanks, actually i am a user of Dualsphysics. By their code can not support multi-device, also multi-node. So im here, i have to say that your code is much more difficult than Dualsphysics, but highly expandable.

Indeed, there’s a lot of complexity related to multi-GPU and multi-node support, as well as to the large variety of simulation options that we support. Much of the refactoring that has been made between version 4 and version 5 was in fact made to allow extensions to be implemented more robustly (e.g. catching logic errors such as the use of buffers before they are properly updated across devices).

And, i found some marco like BLOCK_SIZE_FMAX in forces.cu and it is used to launch a kernel function in CUDA, all these marco are a fixed number, are these number optimal for any device? (RTX, TITAN, TESLA), or experimental.

Mostly, the block sizes have been tuned for specific architectures (e.g. Kepler vs Maxwell). The focus for the optimization of the block size has been made mostly for the forces and euler, and periodically it needs to be refreshed. We’re actually looking into developing some kind of more automated tuning process, but that’s still far down the road. If you find that for your architecture a different block size works better, do let us know.

It is hard for me to understand your code now, especially runSimulation and runCommand,and i will ask for more support next time.

Sure, ask and we’ll provide more details. We can also take this opportunity to collect more material to improve the documentation