Hello @JoJo,
to ensure that the neighbors of each particle are found at most in the cells in the Moore neighborhood of the particle cell, the cell side must be no less than the search radius.
(The search radius itself must be no less than the smoothing kernel compact support radius kh, but it can be higher, which can be tuned in GPUSPH by changing the neighbors search expansion parameter; using a larger search radius than the kernel support radius allows you to build the neighbors list less frequently without losing interactions.)
Note that the cell side can be larger than the search radius: the condition that neighbors are found in the Moore neighborhood of the cells is still satisfied. It is not satisfied anymore if the cell side is smaller than the search radius.
Now, to ensure that an integer number of cells span the world size W, you want the number C of cells to be no more than W/S. We choose C = floor(W/S) <= W/S, because this ensures that for the cell side D we have D = W/C >= W/(W/S) = S.
Consider as an example the case where W = 70 and S = 3; then W/S = 23.(3), and C = 23; then D = W/C = 70/23 = 3.0434… > 3, meaning that the cell side is no less than the search radius and the Moore neighborhood condition is satisfied. If instead we had picked the ceiling, then the number of cells would have been 24 and the cell side W/C would have been 2.91(6) < 3: smaller than the search radius, with the potential to miss some neighbors.
(Note that in general in SPH missing some neighbors far away is not a big issue, since their contributions is small; but technically any code that uses the ceiling function to choose the number of cells is wrong.)
In general the discrepancy between the cell side and the search radius is small, but in low resolutions and/or small domains this can become large-ish.
Concerning the use of OpenMP to parallelize the host code, this is actually something that we’ve been pondering ourselves. You can find some comments e.g. in doWrite
about parallelizing some of the stuff there, which is run on host. We’ve never actually done it because it’s a very low priority task compared to many other things we’re working on.
For what it’s worth, I don’t think using OpenMP to parallelize some host tasks would lead to significant conflicts with the worker threads: their load is usually relatively small anyway (busy-waiting for the GPU kernels to finish). At worst I’d expect that what would happen is that in case of heavy load the next command in the chain may suffer from slightly higher latency, which is something we haven’t optimized for yet, anyway.
(For the future, please open a separate thread to ask unrelated questions, it helps keep the discussions organized ;-))