Here is my understanding:
FORCES_SYNC, first we finish data transfer, then begins to compute the forces of all particles in current device (include particles belong to INNER_EDGE).
FORCES_ENQUEUE: forces of INNER particles (neighbors of these particles are all inside current device) are computed first, and this stage is so called 1st stripe. at the same time, another stream execute the data transfer simultaneously. at the end we compute the forces of INNER_EDGE (2nd stripe).

am i right? please point out any mistake.

anyway, it is clearly FORCES_ENQUEUE is faster than FORCES_SYNC because the latency hiding data transfer. however, i want to ask, how to ensure that 1st stripe takes more time than data transfer, i ask this because if this is ensured, than we can begins to execute 2nd stripe immediately after 1st stripe without any latency. Is there any empirical formula?

And, if we consider NVlink, how to make good use of its high bandwith. Should we modify the code? (sorry, i don’t know whether NVlink have its primitives)

Hello @JoJo,

FORCES_ENQUEUE actually works the other way around: it computes edge particles first, and then asynchronously runs the forces computation for inner particles; this allows the subsequent UPDATE_EXTERNAL to transfer edge particle data to adjacent devices.

Indeed, in multi-GPU, the striping approach (ENQUEUE + COMPLETE) is usually faster than the synchronous approach inasmuch it covers the data transfer latency. However, as you remark, there is no guarantee that this is always exactly the case. The two main things affecting the efficiency of the striping approach are: size of the slices, and number of bursts for data transfers.

Concerning the size of the slices, you want the inner (no neighbors on other GPUs) slice to be “large”. Using one of the standard split methods, the inner edge slice(s) are one cell thick, and you may have up to two of them (one on each side along the split direction); in our experience, you want the inner slice to be at least 3x larger (and thus at least 6 cells thick). Of course this are only indicative measures, and are affected by how particles are distributed through the cells, but in practice this does mean that using “too many GPUs” will actually make the code run slower, because the inner slices get too thin and transfers start dominating.

The second aspect is the number of bursts needed to transfer data. This depends on how the split axis is chosen with respect to the cell linearization. GPUSPH shows the number of bursts needed for data transfers. When the partition is optimal, you get one burst per edge per device. If the number is larger, there are likely benefits to be obtained by using a different cell linearization and/or splitting axis.

In general GPUSPH makes no effort to “guarantee” that the inner forces computation takes longer than the data transfer, this is left entirely to the choices of the user (in particular, there is no load balancing in this version of the code; we used to have some in an older version that didn’t support multi-node). Do keep in mind however that even if the inner forces computation isn’t large enough cover for the full data transfer, there are performance gains over the sync approach.

Concerning NVLink, as far as I know the bus is chosen automatically by the driver, it cannot be controlled by the developers. NVLink, if present, should be automatically used by peer-to-peer data transfers (which we use, if available). But I do not have NVLink multi-GPU configurations, so I cannot test this directly, I’m afraid.

You reminded me, i feel strange when i read the comments in FORCES_ENQUEUE, i read it again, yes, i am wrong, compute edge particles first. But i think when conputing edge particles, we need data from outer edge particles, which belongs to other device, on the contrary, inner particles don’t need that. so base on this trait, it seems like we can compute inner particles first to run faster. in this way, i think we only have to ensure the number of inner particles in devices are as close as possible,then we need dynamic load balance. cases like dambreak, dynamic load balance would be very important… anyway, i would like to ask why compute edge particles first.

nice to hear that the old version have load balance, how can i get that version.

Now, i trying to implement adaptive particle refinement proposed by Chiron (https://www.sciencedirect.com/science/article/pii/S0021999117308082) base on GPUSPH, because in some case, large scale, it is almost impossible to simulate using uniform particle spacing, even MULTI_NODE and MULTI_GPU, i am working on fillDevicesBy… now, very challenging

Hello @JoJo,

in multi-device mode, the sequence is kernel, update, kernel, update, kernel, update, etc. Conceptually, you can see this in two ways: either a sequence of (kernel, update), or a sequence of (update, kernel).

In GPUSPH we consider each update as the “completion” of the previous kernel, so the sub-step is not complete unit the new data has been transferred to other devices. When forces (either sync or async) is invoked, all edge data has been already copied across devices, so the edge particles can be computed. We then cover the latency of the following update (i.e. of the arrays updated by forces) by computing the inner particles while the edge get transferred.

Concerning the version with load-balancing, I’m afraid it’s not publicly available. The design of the code was completely redone to support multi-node anyway (including support for arbitrary split and cell distribution) so the logic wouldn’t be compatible with the new structure.

Thanks for the bibliographic references, I’ll give it a read. If you have any particular issues that need addressing, let us know.