Running on multiple nodes of GPU's

Ethan_Evans · April 6, 2024, 9:14pm

I am attempting to run across multiple nodes of GPU’s. MPI is enabled and I am using Open-MPI 4.1.4. When I run the batch file attached below (2 nodes), I get an error which I believe is due to a lack on memory on the GPU even though the memory should be more than adequate (I have no issue running that number of particles on a single node, i.e. half the number of GPU’s)

If I reduce the number of particles to a very small number, the file will run although the multi-node speed is much slower than that of a single node

The batch file I am using is shown here. I am guessing I have some misunderstandings about how I should format the mpirun command.

Appreciate any advice you have,
Ethan

giuseppe.bilotta · April 7, 2024, 8:25am

If I’m reading this correctly, it’s trying to allocate a neighbors list with 3757590016 elements (that’s 3.7 billion), so it seems to believe it can go for ~30M particles per GPU, but is failing to do so. I’ll need some additional information for this, including what GPUs you’re trying to run the system on, how much memory they have, and how many particles total the simulation has. A full log (your mpirun probably supports a command-line -output-filename option that should allow you to redirect all outputs to file) would also be useful as it will contain information about how many particles GPUSPH thinks it can load on a device and why.

Ethan_Evans · April 7, 2024, 3:37pm

Yeah 3.7 billion particles is not good. The number of particles should be 128 million (using a pph of 400 for 2 cubic meters of domain). This is running on 2080TI GPU’s which have 11 GB (or 10.75 GiB) of memory. 128 million particles should be achievable on 5 of these GPU’s, here I am trying to use 16 (2 nodes of 8). Here is the full output log. Let me know are issues accessing it, I wasn’t sure how to attach it.

Output file for 2 node StillWater Case

The output seems to show 8 million particles per device which is what I wanted. I am confused where the 3.7 billion particle neighbor list came from.

Thanks so much!

giuseppe.bilotta · April 7, 2024, 5:00pm

3.7 billion elements for the neighbors list actually corresponds to ~30M particles with the standard neighbors list size (128 neibs per particle), so that’s reasonable. (Yes, the neighbors list is the largest array in GPUSPH). Probably it’s running out of memory due to some silly allocation alignment requirement or something like that that leads to the last array not fitting just barely.

Unrelated to your specific issue, but from the logs is that you built for Compute Capability 5.0, but the GPUs you’re running the simulation on has Compute Capability 7.5. You should recompile GPUSPH for the correct compute capability (make compute=75). Also it seems the GPUs cannot peer, so data transfers between devices is going to be slow (even on the same node). This isn’t something you can do anything about, but it will affect scaling.

Now, onto your specific issue. These are the relevant message:

Estimated memory consumption: 12B/cell
Estimated memory consumption: 376B/particle
NOTE: device 0 can allocate 29445412 particles, while the whole simulation might require 132128272

The estimation done by GPUSPH seems correct, in the sense that if you multiply the allocated particles and cells by their estimated memory consumption, you get a total amount of memory that is well within the amount of free memory in the GPU, plus some. These kind of allocations are done with a 16MiB safety margin, but apparently this isn’t enough to safeguard the calculation in this case.

There is a small issue in that we’re rounding the number of particles UP rather than DOWN, which can be solved with this patch:

diff --git a/src/GPUWorker.cc b/src/GPUWorker.cc
index f756c2dfa..92bcfb6f2 100644
--- a/src/GPUWorker.cc
+++ b/src/GPUWorker.cc
@@ -250,7 +250,10 @@ void GPUWorker::computeAndSetAllocableParticles()
 	freeMemory -= cellsMem + safetyMargin;
 
 	// keep num allocable particles rounded to the next multiple of 4, to improve reductions' performances
-	uint numAllocableParticles = round_up<uint>(freeMemory / memPerParticle, 4);
+	// NOTE that we should round DOWN here, rather than up
+	const uint allocableParticlesRounding = 4;
+	uint numAllocableParticles = freeMemory / memPerParticle;
+	numAllocableParticles = (numAllocableParticles / allocableParticlesRounding)*allocableParticlesRounding;
 
 	if (numAllocableParticles < gdata->allocatedParticles)
 		printf("NOTE: device %u can allocate %u particles, while the whole simulation might require %u\n",

Can you change src/GPUWorker.cc (around line 255) as indicated in the patch and see if it’s sufficient? If not, can you increase allocableParticlesRounding to see at which point it finally allows running the simulation? I would ask you to try 16, and then 256 if 16 is not enough, and then 1024 if 256 is not enough.

Ethan_Evans · April 7, 2024, 5:53pm

Thanks for the help!
I made the patch and there doesn’t seem to be any change in the error. I set allocableParticlesRounding up to 65536 (4^8) and the 3.7 billion element number in the error is still the same. Should I go higher?

Cheers,
Ethan

giuseppe.bilotta · April 7, 2024, 9:04pm

Well, there’s definitely something weird going on here. Apparently the allocation is failing for more than just some alignment issue, and we’ll need something more drastic.

I think at this point you can bring back allocableParticleRounding down to 4, and we can try other two tunables.

One is safetyMargin around src/GPUWorker.cc line 220, which is currently set to 16MiB. You can try raising it (just change the 16 to 32, 64, etc, up to 1024, that’ll be 1GiB).

The other is the estimated bytes per particle (still in src/GPUWorker.cc, function computeMemoryPerParticle, around line 179): here the bytes per particles are rounded up to the next multiple of 4. You could try upping that to 256 (which should increase the memory per particle to 512 in your case, and decrease the number of particles per device to ~20M if my estimates are correct).

Ethan_Evans · April 8, 2024, 4:47am

Changing computeMemoryPerParticle rounding to 1024 did allow it to run! The speed is much lower than expected but that may be an entirely separate issue.

Is there a way to estimate the space required for the neighbors list? With this scaling, adding additional nodes does not increase the maximum particles for a problem.

Thanks for the help!
Ethan

giuseppe.bilotta · April 8, 2024, 9:56am

Interesting. How many particles are being allocated per device then?

The neighbors list uses a 16-bit integer per neighbor, so the size (in bytes) per particle 2*maxneibs. This has to be multiplied by the total number of particles allocated on the device to get the total memory consumption. Once a simulation is too big to fit on a single device, it doesn’t matter how many devices are going to be used, GPUSPH will only allocate as many particles as it can (modulo issues with the automatic computation such as the ones you’ve experienced), which depends on the amount of device memory, so the neighbors list size won’t change.

I’m still perplexed at why the automatic estimation by GPUSPH was failing. I have pushed some changes to the development branches, and if you have the possibility to run your test case on the new code I pushed, with the --debug buffer-alloc command-line option, I would appreciate it. Also, with the changes I’ve made it’s now possible to manually set the number of particles per device using the --max-device-particles command-line option, which should be easier than hacking around in the source code to correct the GPUSPH estimates.

Performance-wise, as I mentioned, your code was being compiled for the wrong architecture (something that you probably already solved when you recompiled —but do check the logs to make sure you’re running with the correct compute capability set), but there’s also an issue due to the GPUs not being able to peer even on the same node. This will make all device-to-device transfers less efficient. It becomes extremely important in this case that the data transfers are minimized, and that as many particles are “internal” to each device.

Make sure that the number of bursts is minimal (ideally, 1 burst per neighbor device; look for the line data transfers compacted in that will tell you how many local and network bursts are needed. The choice of linearization and split direction become essential here.