I mentioned before that I would write something on running on GPUs. Let’s imagine we want to simulate a solvated lipid bilayer containing 6,000 lipids for 5 µs. The total number of is around 137,000 and the box dimensions are roughly 42x42x11 nm. Although this is smaller than the benchmark we looked at last time, it is still a challenge to run on a workstation. To see this let’s consider running it on my MacPro using GROMACS 4.6.1. The machine is an and has 2 Intel Xeons, each with 4 cores. Using 8 MPI processes gets me 132 ns/day, so I would have to wait 38 days for 5 µs. Too slow!
You have to be careful installing non-Apple supported NVidia GPUs into MacPros, not least because you are limited to 2x6pin power connectors. Taking all this into account, the best I can do without doing something drastic to the power supply is to install an . Since I only have one GPU, I can only run one MPI process, but this can spawn multiple OpenMP threads. For this particular system, I get the best performance with 4 threads (134 ns/day) which is the same performance I get using all 8 cores without the GPU. So when I am using just a single core, adding in the GPU increases the performance by a factor of 3.3x. But as I add additional cores, the increase afforded by the single GPU drops until the performance is about the same at 8 cores.
Now let’s try something bigger. Our lab has a small Intel (Sandy Bridge) computing cluster. Each node has 12 cores, and there are 8 nodes, yielding a maximum of 96 cores. Running on the whole cluster would reduce the time down to 6 days, which is a lot better but not very fair on everyone else in the lab. We could try and get access to Tier-1 or Tier-0 supercomputers but, for this system, that is overkill. Instead let’s look at a Tier-2 machine that uses GPUs to accelerate the calculations.
The University of Oxford, through , has access to the machines owned by the Centre for Innovation. One of these, , is a GPU-based cluster. We shall look at one of the partitions; this has 60 nodes, each with two 6-core Intel processors and 3 NVIDIA M2090 Tesla GPUs. For comparison, let’s run without the GPUs. The data shown are for simulations with only 1 OpenMP thread per MPI process. So now let’s run using the GPUs (which is the point of this cluster!). Again just using asingle OpenMP thread per MPI process/GPU (shown on graph) we again find a performance increase of 3-4x. Since there are 3 GPUs per node, and each node has 12 cores, we could run 3 MPI processes (each attached to a GPU) on each node and each process could spawn 1, 2, 3 or 4 OpenMP threads. This uses more cores, but since they probably would be sitting idle, this is a more efficient use of the compute resource. Trying 2 or 3 OpenMP threads per MPI process/GPU lets us reach a maximum performance of 1.77 µs per day, so we would get our 5 µs in less than 3 days. Comparing back to our cluster, we can get the same performance of our 96-core local cluster using a total of 9 GPUs and 18 cores on EMERALD.
Finally, let’s compare EMERALD to the Tier-1 PRACE supercomputer CURIE. CURIE was the . For this comparison we will need to use a bigger benchmark, so let’s us . It has 9x the number of lipids, but because I had to add extra water ends up being about 15x bigger at 2.1 million particles. Using 24 GPUs and 72 cores, EMERALD manages 130 ns/day. To get the same performance on CURIE requires 150 cores and ultimately CURIE tops out at 1,500 ns/day on 4,196 cores. Still, EMERALD is respectable and shows how it can serve as a useful bridge to Tier-1 and Tier-0 supercomputers. Interestingly, CURIE also has a “hybrid�? partition that contains 144 nodes, each with 2 Intel Westmere processors and 2 NVIDIA M2090 Tesler GPUs. I was able to run on up to 128 GPUs with 4 OpenMP threads per MPI/GPU, making a total of 512 cores. This demonstrates that GROMACS can run on large numbers of GPU/CPUs and that such hybrid architectures are viable as supercomputers (for GROMACS at least).