Some question about the code

JoJo · June 22, 2020, 2:51am

Hi, forums
i have something to ask

i read the code about neighbour searching, i think the algorithm is the so called verlet list, in this method, a real neighbour list is generated. and my question is why verlet list not cell linked list (not a real neightbour list, but a neighbour cell list with reordered particles), as i know, CLL is the better considering the balance of memory usage and efficiency. Moreover, i want to implement the adaptive particles refinement base on GPUSPH (multi-gpu). I think in this situation, CLL is better than VL considering the multi-resolutions.
as i said above, i want to implement APR, i have done it before, but it is based on single GPU, so actually i have no experimence about multiGPU, and any suggestion would be wonderfui.
some state-like variation like state, policy, dirty. i dont know how it work, and what they means.

Kind regards

giuseppe.bilotta · June 22, 2020, 3:23pm

first of all, we are indeed using essentially a Verlet list, i.e. a full list of neighboring particles within the neighbors list search radius (this can be slightly larger than the actual influence radius, and the user has control about how much larger it can be, although we’d like to automatically compute it based on geometric flow features and frequency of neighbors list update).

The main reason why we use the VL and not e.g. a CLL or some other approach is that we are aiming for maximum performance, and the other approaches result in more expensive neighbors list traversal during computation, whose impact is non-trivial, and scales up with the number of computational kernels that need to traverse the neighbors list (this depends on the configuration chosen by the user wrt to boundary model, rheology, density diffusion etc).

(In fact, with the way we define and index cells, we wouldn’t even need to store the actual CLL in a CLL approach, since the neighboring cells are trivial to compute based on the index of the cell each particle belongs to, so the memory savings would be huge: at most, we’d need a 27-bit map with a 1 for the cells that need traversing and 0 for the known-empty ones. On the other hand, the repeated traversal of cells only sparsely populated by neighbors, such as the ones only diagonally adjacent to the central particle cell, would be quite wasteful compute-wise.)

There has been recently a significant restructuring of the neighbors list traversal, and one of the things we’ve focused on has been to try and hide the details about it from the kernels (see src/cuda/neibs_iteration.cuh and the for_each_neib and for_every_neib macros defined there, which are used in all computational kernels that do neighbors list traversal). This should make it possible to define new ways to build and traverse the neighbors list without altering each computational kernel. It might actually become necessary to do some redesign here even if keeping the VL, due to a limitation in our current approach (there’s an upper bound in the number of particles per cell due to the way neighbors are encoded in the list). In theory, it would even be possible to provide multiple approaches to neighbors list traversal, and then let the user choose between e.g. a computationally-expensive memory-conserving approach and the current computationally-cheap memory-expensive approach. (As you know, we love having options in GPUSPH )
concerning APR, I think the key point that we should reason about is how to handle cells, because cells are actually used for three different things in GPUSPH:
- reduce the neighbors search space (similarly to all other SPH codes): the domain is divided in cells whose side is (no less than) the search radius, and this ensures that neighbors only need to be searched for in the immediately adjacent cells;
- homogeneous precision for the particle positions and inter-particle distance; we do not store the position of the particles wrt to the origin of the global reference system, but only wrt to the cell the particle belongs to; this avoids numerical issues when working with large domains and fine resolutions;
- the cell is also the finest granularity for data exchange in multi-GPU; each device is assigned a given set of cells (potentially arbitrary, in practice large consecutive chunks), and data exchange happens between “edge cells”, i.e. cells that belong to one GPU but are adjacent to cells belonging to other GPUs (sometimes also known as the “halo”);
So the first big question is whether APR would rely on a fixed-side grid or even the auxiliary grid will be adaptive: with a fixed-side grid, a lot of the work becomes easy, including mutli-GPU, since it can rely on the existing infrastructure, potentially without any change; if the grid is to become adaptive, then this will require a lot of thinking at a fundamental level, including the issue with particle position tracking.
the buffer state, policy and validity concern buffer allocation and usage, and are used for debugging the programming logic of the integration schemes;
- the allocation policy determines how many copies of each buffer are required (usually either 1 or 2 depending on ping-pong usage, but potentially more may be needed for more sophisticated integration schemes);
- the particle state is used to tag the current state of the buffers; think about labels such as “step n”, “step n*”, “step n+1”;
- validity holds information about the buffer contents: invalid means that the contents are undefined (e.g. uninitialized), dirty means that it has been updated locally but may need a cross-device update in multi-GPU, and valid means we have the latest information. This has actually been very useful to identify logic issues during the refactoring that the code has undergone between versions 4 and 5, but it is not complete yet (particularly the management of dirty vs valid information in some cases needs some refinement —more in general, we might need some finer granularity in tagging the buffer states).

I hope this answers all your questions. I’m open for discussion about the handling of APR and auxiliary grid, so feel free to ask.

JoJo · June 23, 2020, 1:20am

I have implement APR base on single GPU and CLL, in my opinion, APR do use a fixed grid, but just in a resolution layer, i.e., one layer one set of grid, (also one can use a single set of grid, but this absolutely decease the efficiency). No paper introduce this, im writing a paper about it, not submitting yet.

so my concern this, we can use multi-grid to sort, reorder the partucle. Moerover, Shifting, coupling APR and Shifting is not easy, due to the support domain, might be different, in this situation, i dont know if VL can stay its efficieny.

JoJo · June 23, 2020, 1:24am

like, the inactive particles searching the neighyboring active particles for shepard kernel, while searching the nerghboring particles in same layer for shifting

giuseppe.bilotta · June 26, 2020, 8:58am

Hello @JoJo,

hm, this is quite challenging. I think the biggest issue in this case isn’t so much how the neighbors list is structured or encoded, but how pervasive the assumption of a single grid is in GPUSPH. As I said, the grid in GPUSPH isn’t used only for neighbors searching, but we’re also leveraging it for uniform spatial accuracy and multi-GPU domain decomposition and data synchronization.

Even assuming we don’t look into multi-GPU at the moment, one thing that still needs to be considered is how to handle the cell-relative position of particles in the multi-grid case.

JoJo · June 27, 2020, 9:41am

That’s true, i agree with you that it would be challenging. But i believe it is not enough only using the computational power of GPUs, it is necessary to find a way to reduce the compute load. (i mainly focus on three-dimemsional multi-phase)

actually i have some idea, but at first, a comprehensive understanding of GPUSPH code is required, so maybe i will come here to ask lots of question.

some question here,
1.
in density diffusion term (COLAGROSSI)
111
i have read the paper Molteni & Colagrossi 2009, but i didn’t understand why only apply it for large density ratio, and in my expericence, in Dualsphysics, it would be applied for all fluid particles.

What is DYN_BOUNDARY, i can not find the note in code and doc, but i think it is something like Dynamic Boundary Condition in Dualsphysics, right?
And, also MK_BOUNDARY

VL might be re-build every N steps, how about the conmunication between devices, in my opinion, every step, even every sub-step (UPLOAD_EXTERNAL always come after with FORCES_.EQUEUE or FORCES_.SYNC，right?)

Thanks for repling

giuseppe.bilotta · June 27, 2020, 4:40pm

Hello @JoJo,

the Molteni & Colagrossi diffusion term is only applied to same-fluid interactions with large density variations according to the discussion at the end of section 5.2 of the paper, in the paragraph “Switches on the density diffusion correction”, leading to equation (24) of that same paper.
DYN_BOUNDARY refers to the dynamic boundary model (Dalrymple & Knio 2001, Crespo et al. , 2007). MK_BOUNDARY refers to the model proposed by Monaghan & Kajtar in 2009, 10.1016/j.cpc.2009.05.008. Beware that the MK model isn’t well-tested.
communication between devices happens multiple times per step, after each “internal-only” compute kernel, i.e. the ones that do not touch the “halo”. As a general rule, if a computation needs the neighbors, then it will be followed by an UPDATE_EXTERNAL, otherwise not. For example, forces computation is followed by an exchange, but integration is not (because integration can be applied to halo particles as well if the forces have been exchanged). The reason for having ENQUEUE and SYNC versions of the FORCES is related to striping, i.e. the possibility to overlap computation and data transfers (this is currently only implemented for the forces computation)