Question about fillDevice

Hi, I am preparing a large-scale test case. However, I have some questions about splitting the domain.
Moreover, It is difficult to test by myself because such a test case needs a supercomputer. So I want to ask the contributors.

After I read the code, I have noticed that in the communication between devices, It actually requires a particle array reordered by its cellindex continuously. Moreover, the communications are executed only once per step (I mean, one bursts) between two certain devices. However, in the situation below, cellindex absolutely not continuous.

But, I do notice if one split the domain by function fillDeviceMapByAxis continuously, or fillDeviceMapByAxesSplit. It may cause the situation aforementioned.
So, maybe I lose some information.
I want to ask if the aforementioned situation is a problem. If not, how you deal with it.

Hello @JoJo,

the “halo” cells (those that need to be exchanged between devices) do not need to be consecutive in memory. If they are not, GPUSPH will split the transfer in multiple bursts (as many as necessary). On startup, it will tell you how many bursts will be necessary for each bordering device-device pair. The only downside of having multiple bursts is that this increases the latency cost of inter-device communication, which will slow down the simulation and reduce the scaling. Enabling striping helps a lot with that.

There are mainly two variables that contribute to the number of bursts: the way cells are distribute through devices on one side, and the way linearization is done on the other. You can change the linearization with with make linearization=yzx where yzx can be in any order (e.g. you could choose xyz or zxy). Which linearization is best depends on the problem layout and device distribution, but in general you want to pick one that minimizes the number of bursts.

Right, I know how linearization works. However, I think if you split the domain by Xslice, Yslice, Zslice (X,Y,Zslice != 1). In this situation, you still have to split the bursts. Otherhand, if you split the domain along just one axis, then only one bursts you need using the right linearization. If I am worrg, please point to out.

Nice to hear that my worrying is not a problem.

Exactly: if you split along a single axis, then there is a linearization that will allow all transfers to be collected into a s single burst, but if you have a more complex split (e.g.a 2D split) then some of the device boundaries will require multiple bursts.

As I mentioned, the only downside with multiple bursts is that data transfers will take more time, so if you don’t have enough workload with the inner particles, the kernels might not be able to fully cover the async transfers, and you will see a decrease in performance/scaling as the number of GPUs grows.

Exactly, the performance is relative to how you split the domain and the factor of Ninner / Nedge.
As a result, I think it is difficult to define the accelerated rate on multi GPU, i.e., only large-scale simulation can show the capability of multi Host / multi GPU. However, such a simulation can not be run on single GPU :laughing:

Hello @JoJo,

yes and no.

It’s true that strong scaling (i.e. adding more GPUs at constant problem size) has a relatively low upper bound. This is true for everything (per Amdhal’s law). In our case, per our experience, you can get maybe to 4 GPUs if your problem size is just large enough to completely fill one (high-end, with the most RAM) GPU, and in this case you can usually do a one-axis split. Higher GPU numbers are usually more convenient when you’re doing weak scaling, i.e. increasing both the number of GPUs and the problem size.

You can still measure the speed-up that you get in the weak scaling case, though, by measuring the effective performance you manage to obtain: for example, if you get 20 MIPPS on a single GPU, and get 38MIPPS on two GPUs, you have 95% efficient scaling. With multi-node it’s a bit harder to estimate since each node computes its own MIPPS independently, but in this case the total number of MIPPS is the sum. So for example if you have 4 nodes with 2 GPUs each, and you have 35MIPPS on each node
(lower than the 38MIPPS you would get without multi-node), you would still get good scaling (140 MIPPS on 8 GPUs is about 87.5% efficiency).

Thanks for the reply. There are some words the first time I have heard. :joy:
After I implemented the multi-resolution, shifting, and density diffusive term (ANTUONO) based on GPUSPH (master)
I found that speed-up is lower than official GPUSPH (But I think it is not bad.). I believe this is because more communications are required (also I set buildneibsfreq to be 1). However, it is relative to the problem size. The larger problem the higher speed-up.

Hello @JoJo,

it’s normal to have a first implementation that is correct even if it doesn’t scale well.

The biggest obstacle to scaling is the data exchange. Aside from minimizing the exchange of data, the only way to reduce its impact is to split kernel calls so that they first run on the halo, and then the rest runs on the inner particles while the halo data is exchanged.

Even in the released version of GPUSPH, this is only done for the FORCES kernel currently, since it’s where most of the computational work happens. We’re looking into porting other kernels to similar mechanism, which can be quite important for some formulations, but we would like to do this in a way that is relatively easy to implement (ideally, this should then be applied to all kernels that run before an UPDATE_EXTERNAL).

And yes, rebuilding the neighbors list at every timestep will definitely not help with the scaling, since there’s a global synchronization point there due to the need to update the information about who and where the halo particles are.

(BTW, I’m really looking forward to the changes you’ve implemented. I hope you’ll find some time to clean up their implementation and submit them for merging into next.)