[molpro-user] How good is GPU-boost in Molpro, in particular for CCSD, EOM-CCSD calculations

Mon Jun 10 18:35:09 BST 2013

Jeff,

Thanks a lot for your very comprehensive reply.

Best regards,
Evgeniy

Jeff Hammond wrote:
> At least according to
> http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf,
> Molpro has GPU support for "Density-fitted MP2 (DF-MP2), density
> fitted local correlation methods (DF-RHF, DF-KS), DFT", which does not
> include (EOM-)CCSD.  A quick examination of the source indicates that
> this information is still accurate.
>
> An abstract answer to this question is that one can speedup an
> MO-driven CCSD with GPUs by a factor of A, where A is the relative
> performance of DGEMM on the CPU(s) and GPU(s).  However, to achieve A,
> one has to reduce and hide data motion as much as possible.  I found
> that a naive implementation sees 2x improvement with an NVIDIA Fermi
> relative to a dual Xeon X5500 series while the optimized
> implementation sees about 5x speedup (the optimized implementation
> uses the CPU and the GPU at the same time and is compared to CPU-only
> execution).  In these comparisons, the GPU runs ~500 GF DP peak and
> the dual-socket CPU cores run at ~100 GF DP peak.  See the papers
> linked on  https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond#GPUs
> for details.
>
> On the other hand, if one runs AO-driven CCSD where the integrals are
> computed on the GPU at ~30x the rate of the CPU (claims of higher
> speedups on the GPU result from flawed performance analsyis and
> apples-to-oranges comparisons), I think it should be possible to see
> speedups of ~10-30x relative to a CPU-only implementation.  However,
> since I've not implemented such a code, this is only speculation.
>
> An NVIDIA Tesla will lead to almost no improvement for DP.  Only
> NVIDIA Fermi has useful support for DP.  However, mixed-precision
> algorithms can be designed to give SP speedup with DP accuracy for
> iterative procedures like CCSD, assuming that they are numerically
> stable in SP (the details are given in aforementioned papers).  Hence,
> the mixed-precision algorithm will run at perhaps ~10x relative to the
> CPU in DP, but still only ~5x relative to SP on CPU, which is the
> appropriate apples-to-apples performance comparison.
>
> Note that all previous speedups are relative to the CPUs when using
> all the cores with threads and vectorized kernels, which is the only
> proper way to compare a CPU to a GPU.  Comparing a GPU to a single CPU
> core is nonsensical (comparing a single NVIDIA GPU SM to a single CPU
> core is valid but no one bothers to do this).<end polemic>
>
> Acronyms used:
> GF = gigaflop/s
> DP = double precision
> SP = single precision
> SM = streaming multiprocessor composed of 15+ "CUDA cores"
>
> Best,
>
> Jeff
>
> On Mon, Jun 10, 2013 at 9:47 AM, Evgeniy Gromov
> <Evgeniy.Gromov at pci.uni-heidelberg.de>  wrote:
>> Dear Developers and Users of Molpro,
>>
>> I wonder if someone has tried/tested the performance of
>> Molpro boosted by GPUs (Tesla). I am interested in particular
>> in speed up for CCSD and EOM-CCSD calculations.
>>
>> Best regards,
>> Evgeniy
>> --
>> _______________________________________
>> Dr. Evgeniy Gromov
>> Theoretische Chemie
>> Physikalisch-Chemisches Institut
>> Im Neuenheimer Feld 229
>> D-69120 Heidelberg
>> Germany
>>
>> Telefon: +49/(0)6221/545263
>> Fax: +49/(0)6221/545221
>> E-mail: evgeniy at pci.uni-heidelberg.de
>> _______________________________________
>>
>>
>>
>> _______________________________________________
>> Molpro-user mailing list
>> Molpro-user at molpro.net
>> http://www.molpro.net/mailman/listinfo/molpro-user
>
>
>

-- 
_______________________________________
Dr. Evgeniy Gromov
Theoretische Chemie
Physikalisch-Chemisches Institut
Im Neuenheimer Feld 229
D-69120 Heidelberg
Germany

Telefon: +49/(0)6221/545263
Fax: +49/(0)6221/545221
E-mail: evgeniy at pci.uni-heidelberg.de
_______________________________________