Hello @JoJo,
FORCES_ENQUEUE
actually works the other way around: it computes edge particles first, and then asynchronously runs the forces computation for inner particles; this allows the subsequent UPDATE_EXTERNAL
to transfer edge particle data to adjacent devices.
Indeed, in multi-GPU, the striping approach (ENQUEUE
+ COMPLETE
) is usually faster than the synchronous approach inasmuch it covers the data transfer latency. However, as you remark, there is no guarantee that this is always exactly the case. The two main things affecting the efficiency of the striping approach are: size of the slices, and number of bursts for data transfers.
Concerning the size of the slices, you want the inner (no neighbors on other GPUs) slice to be “large”. Using one of the standard split methods, the inner edge slice(s) are one cell thick, and you may have up to two of them (one on each side along the split direction); in our experience, you want the inner slice to be at least 3x larger (and thus at least 6 cells thick). Of course this are only indicative measures, and are affected by how particles are distributed through the cells, but in practice this does mean that using “too many GPUs” will actually make the code run slower, because the inner slices get too thin and transfers start dominating.
The second aspect is the number of bursts needed to transfer data. This depends on how the split axis is chosen with respect to the cell linearization. GPUSPH shows the number of bursts needed for data transfers. When the partition is optimal, you get one burst per edge per device. If the number is larger, there are likely benefits to be obtained by using a different cell linearization and/or splitting axis.
In general GPUSPH makes no effort to “guarantee” that the inner forces computation takes longer than the data transfer, this is left entirely to the choices of the user (in particular, there is no load balancing in this version of the code; we used to have some in an older version that didn’t support multi-node). Do keep in mind however that even if the inner forces computation isn’t large enough cover for the full data transfer, there are performance gains over the sync approach.
Concerning NVLink, as far as I know the bus is chosen automatically by the driver, it cannot be controlled by the developers. NVLink, if present, should be automatically used by peer-to-peer data transfers (which we use, if available). But I do not have NVLink multi-GPU configurations, so I cannot test this directly, I’m afraid.