[molpro-user] How good is GPU-boost in Molpro, in particular for CCSD, EOM-CCSD calculations

Mon Jun 10 18:24:18 BST 2013

At least according to
http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf,
Molpro has GPU support for "Density-fitted MP2 (DF-MP2), density
fitted local correlation methods (DF-RHF, DF-KS), DFT", which does not
include (EOM-)CCSD.  A quick examination of the source indicates that
this information is still accurate.

An abstract answer to this question is that one can speedup an
MO-driven CCSD with GPUs by a factor of A, where A is the relative
performance of DGEMM on the CPU(s) and GPU(s).  However, to achieve A,
one has to reduce and hide data motion as much as possible.  I found
that a naive implementation sees 2x improvement with an NVIDIA Fermi
relative to a dual Xeon X5500 series while the optimized
implementation sees about 5x speedup (the optimized implementation
uses the CPU and the GPU at the same time and is compared to CPU-only
execution).  In these comparisons, the GPU runs ~500 GF DP peak and
the dual-socket CPU cores run at ~100 GF DP peak.  See the papers
linked on  https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond#GPUs
for details.

On the other hand, if one runs AO-driven CCSD where the integrals are
computed on the GPU at ~30x the rate of the CPU (claims of higher
speedups on the GPU result from flawed performance analsyis and
apples-to-oranges comparisons), I think it should be possible to see
speedups of ~10-30x relative to a CPU-only implementation.  However,
since I've not implemented such a code, this is only speculation.

An NVIDIA Tesla will lead to almost no improvement for DP.  Only
NVIDIA Fermi has useful support for DP.  However, mixed-precision
algorithms can be designed to give SP speedup with DP accuracy for
iterative procedures like CCSD, assuming that they are numerically
stable in SP (the details are given in aforementioned papers).  Hence,
the mixed-precision algorithm will run at perhaps ~10x relative to the
CPU in DP, but still only ~5x relative to SP on CPU, which is the
appropriate apples-to-apples performance comparison.

Note that all previous speedups are relative to the CPUs when using
all the cores with threads and vectorized kernels, which is the only
proper way to compare a CPU to a GPU.  Comparing a GPU to a single CPU
core is nonsensical (comparing a single NVIDIA GPU SM to a single CPU
core is valid but no one bothers to do this).  <end polemic>

Acronyms used:
GF = gigaflop/s
DP = double precision
SP = single precision
SM = streaming multiprocessor composed of 15+ "CUDA cores"

Best,

Jeff

On Mon, Jun 10, 2013 at 9:47 AM, Evgeniy Gromov
<Evgeniy.Gromov at pci.uni-heidelberg.de> wrote:
> Dear Developers and Users of Molpro,
>
> I wonder if someone has tried/tested the performance of
> Molpro boosted by GPUs (Tesla). I am interested in particular
> in speed up for CCSD and EOM-CCSD calculations.
>
> Best regards,
> Evgeniy
> --
> _______________________________________
> Dr. Evgeniy Gromov
> Theoretische Chemie
> Physikalisch-Chemisches Institut
> Im Neuenheimer Feld 229
> D-69120 Heidelberg
> Germany
>
> Telefon: +49/(0)6221/545263
> Fax: +49/(0)6221/545221
> E-mail: evgeniy at pci.uni-heidelberg.de
> _______________________________________
>
>
>
> _______________________________________________
> Molpro-user mailing list
> Molpro-user at molpro.net
> http://www.molpro.net/mailman/listinfo/molpro-user

-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhammond at alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
ALCF docs: http://www.alcf.anl.gov/user-guides