3.7 billion elements for the neighbors list actually corresponds to ~30M particles with the standard neighbors list size (128 neibs per particle), so that’s reasonable. (Yes, the neighbors list is the largest array in GPUSPH). Probably it’s running out of memory due to some silly allocation alignment requirement or something like that that leads to the last array not fitting just barely.
Unrelated to your specific issue, but from the logs is that you built for Compute Capability 5.0, but the GPUs you’re running the simulation on has Compute Capability 7.5. You should recompile GPUSPH for the correct compute capability (make compute=75
). Also it seems the GPUs cannot peer, so data transfers between devices is going to be slow (even on the same node). This isn’t something you can do anything about, but it will affect scaling.
Now, onto your specific issue. These are the relevant message:
Estimated memory consumption: 12B/cell
Estimated memory consumption: 376B/particle
NOTE: device 0 can allocate 29445412 particles, while the whole simulation might require 132128272
The estimation done by GPUSPH seems correct, in the sense that if you multiply the allocated particles and cells by their estimated memory consumption, you get a total amount of memory that is well within the amount of free memory in the GPU, plus some. These kind of allocations are done with a 16MiB safety margin, but apparently this isn’t enough to safeguard the calculation in this case.
There is a small issue in that we’re rounding the number of particles UP rather than DOWN, which can be solved with this patch:
diff --git a/src/GPUWorker.cc b/src/GPUWorker.cc
index f756c2dfa..92bcfb6f2 100644
--- a/src/GPUWorker.cc
+++ b/src/GPUWorker.cc
@@ -250,7 +250,10 @@ void GPUWorker::computeAndSetAllocableParticles()
freeMemory -= cellsMem + safetyMargin;
// keep num allocable particles rounded to the next multiple of 4, to improve reductions' performances
- uint numAllocableParticles = round_up<uint>(freeMemory / memPerParticle, 4);
+ // NOTE that we should round DOWN here, rather than up
+ const uint allocableParticlesRounding = 4;
+ uint numAllocableParticles = freeMemory / memPerParticle;
+ numAllocableParticles = (numAllocableParticles / allocableParticlesRounding)*allocableParticlesRounding;
if (numAllocableParticles < gdata->allocatedParticles)
printf("NOTE: device %u can allocate %u particles, while the whole simulation might require %u\n",
Can you change src/GPUWorker.cc
(around line 255) as indicated in the patch and see if it’s sufficient? If not, can you increase allocableParticlesRounding
to see at which point it finally allows running the simulation? I would ask you to try 16, and then 256 if 16 is not enough, and then 1024 if 256 is not enough.